Saturday, August 19, 2006

Studies Prove

Thomas Sowell recently wrote a series of articles entitled “Studies Prove” (I, II and III). He gives examples, some from personal experience, about how stakeholders will selectively use data that bolsters their theory and suppress other data that doesn’t. A few salient points:
It was a valuable experience so early in my career to learn that what "studies prove" is often whatever those who did the studies wanted to prove. ... it is a terminal case of naivete to put statistical studies under the control of the same government agencies whose policies are being studied.

Nor will it do any good to let those agencies farm out these studies to "independent" researchers in academia or think tanks because they will obviously farm them out to people whose track record virtually guarantees that they will reach the conclusions that the agency wants.
In part III, he discusses a study that “proved” the effectives of affirmative action policies at universities. However, the study authors would not release their raw data for scrutiny by others, including the distinguished Harvard Professor Stephen Thernstrom, who has conducted some famous studies of his own. Prof. Sowell tells us of a similar experience he had:
Back in the 1970s, I tried to get statistical data from Harvard to test various claims about affirmative action. Derek Bok was then president of Harvard and he was the soul of graciousness, even praising a book on economics that I had written. But, in the end, I did not get to see one statistic.

During the same era I was also researching academically successful black schools. I flew across the country to try to get data on one school, talked with board of education officials, jumped through bureaucratic hoops -- and, after all this was done and the dust settled, I still did not get to see one statistic.

Why not? Think about it. Education officials have developed explanations for why they cannot educate black children. For me to write something publicizing outstanding academic results in this particular black school would be to open a political can of worms, leading people to ask why the other schools can't do the same.

Education bureaucrats decided to keep that can sealed.

Critics of affirmative action have long said that mismatching black students with colleges that they do not qualify for creates wholly needless academic failures among these students, who drop out or flunk out of colleges that they should never have been in, when most of them are fully qualified to succeed in other colleges.

Has the ending of preferential admissions in the University of California system and the University of Texas system led to a rise in the graduation rates of black students, as critics predicted? Who knows? These universities will not release those statistics. [Emphasis added]
One of the repeating themes of my posts is the plea to make as much data public as possible. For example, state boards of education and state colleges and universities have a wealth of data on how prospective teacher candidates perform on their licensure exams. Examination of this data could help explain why some states can set cut-scores 30 points higher (on a 100 point test) than others. But since this data might also be embarrassing as well as revealing, it is not available.

When I was soliciting data from the Educational Testing Service (ETS) for my investigations, it was made very clear that I could not have any disaggregated state level data. This restriction was a contractual obligation ETS had with the individual states that contracted for ETS’s services. Otherwise, ETS could hardly have been more gracious or cooperative.

Since policy advocacy often taints research I have been interested to read the studies of a outfit that claims to be “non-aligned and non-partisan” — The Nonpartisan Education Review:
We are allied with neither side. We have no vested interest. Unlike the many allied education pundits and researchers who call themselves independent, we actually are. And, we prove it by criticizing both sides, though probably not nearly as much as they deserve.

The Nonpartisan Education Review’s purpose is to provide education information the public can trust.
One of their reports, which discussed how states cheat on student proficiency tests, was featured in my post History Lesson.

I found this article by Richard Phelps of particular interest. It serves as an introduction to the caveats of educational research. It begins:
Over a decade ago, I was assigned a research project on educational testing at my day job. I read the most prominent research literature on the topic, and I believed what I read. Then, I devoted almost two years to intense study of one subtopic. The project was blessed with ample resources. In the end, it revealed that the prevailing wisdom was not only wrong but the exact opposite of reality.
He then exhibits a long list of claims all of which are “either wrong or grossly misleading”.

So perhaps it shouldn’t have come as a big surprise to me that when states and the federal government want there to be an ample supply of “highly qualified” math and science teachers, their data will show that abracadabra they pop into existence, whether they really exist or not.

Thursday, August 10, 2006

The Highly Qualfied Physical Science Teacher

What content knowledge is needed to be an effective science teacher? I began pondering this question when, by a quirk in NJ standards, I was required to take content knowledge tests in both physics and chemistry. Until 2004, NJ did not have separate chemistry and physics certifications. They only had physical science certification. This required knowledge of both chemistry and physics.

If one had trained to be such a combined physics/chemistry teacher then there would be no problem. However, NJ gets a substantial fraction of these science teachers through its alternate route programs. Typically such an alternate route candidate would have a background in chemistry or physics, but not both. Such was my case.

I know physics but not chemistry. I have advanced degrees in physics. My chemistry background consists of high school chemistry and one course in physical chemistry as a college freshman. That was more than 30 years ago. I have not had much contact with chemistry since. I have had no organic chem and no biochem — both of which are on the Praxis II test that NJ uses. In my opinion, I do not have the content knowledge necessary to teach high school chemistry (nor would I meet the current NJ requirement of at least 15 college credits).

If you have been following my earlier posts, you can guess how NJ got the physics people to pass chemistry and the chemistry people to pass physics. It just set very low standards. To earn physical science certification NJ required three tests. For physics, they used the one-hour Praxis II (10261) test of physics content knowledge, for chemistry the one-hour Praxis II (20241) test of chemistry content knowledge. [They also required a Praxis II test in General Science (10431) that includes biology.] The pre-2004 NJ cut-scores were a 119 for chemistry (19% of the scaled points possible) and a 113 for physics (13% of the scaled points possible).

How low these scores are, was put into perspective for me by my performance. I surpassed the chemistry cutoff by more than 60 points. This was my moment of enlightenment. Something was seriously wrong if my level of chemistry knowledge was more than 4 times the “highly qualified” minimum.

A majority of states use the two-hour versions of the Praxis II content knowledge tests (10265 and 20245). In chemistry, the cut-scores run from a high of 158 (Delaware) to a low of 135 (South Dakota). The 85th percentile score is 184. Assuming the one- and two-hour tests are comparable, it is comforting to know that at least 15% of the chemistry teachers know more chemistry than me. Cut-scores for individual states can be found on the ETS website (here) or at each state’s Department of Education website.

In physics the high cut-score is 149 (Indiana), the low 126 (West Virginia). The 85th percentile score is a 177. Delaware sets a cut-score of 112 on the one-hour physics test, a truly abysmal standard. Utah requires these tests but sets no cut-score. My guess is that Utah will eventually set cut-scores at that level that gives them an adequate supply of teachers. Objective standards, objective standards, we don’t need no stinkin’ objective standards.

Further analysis of these results is problematic. The Education Trust did not review the content of these exams, so what follows is entirely my own opinion. On the previously discussed math Praxis II, I thought a high score (above 180) was solid evidence of mastery. The physical science tests simply do not have content challenging enough for me to reach a similar conclusion. To score highly on the physics test, one only needed rote knowledge of a few formulas. Few of the questions tested concepts. One could have a high score and still have a poor conceptual understanding of the subject. Similarly, in chemistry I would not claim mastery of either rote knowledge or concepts and yet I had a high score.

Prior to NCLB and its “highly qualified” provisions, the minimal ability definition was a do no harm standard:
In all professional licensure assessments, minimum competency is referred to as the point where a candidate will “do no harm” in practicing the profession.
The post-NCLB era uses loftier language,“highly qualified”, but hasn’t actually raised the standards. In my opinion, on these tests, scores below 160 fail the “no harm” standard. Essentially these teachers have failed (scored below 60%) on a test of fairly rudimentary rote subject knowledge. I suspect these low scoring prospective teachers would also struggle on the SAT II or AP tests, yet we are saying they are “highly qualified” to help prepare our children to take these exams.

You should not have to take my word for it. It would be nice if old versions of these tests passed into the public domain. Without this level of transparency, the level of these tests remains largely hidden from public scrutiny. You can get some idea from the sample tests I linked to, but to really understand you need to see all the questions on a real test.

Before closing this topic, an appeal to anyone who can clarify the situation in Delaware. Delaware sets the lowest standard in physics, a 112 (on the one-hour test). They set the highest standard in chemistry, a 158 (on the two-hour test). They are transitioning their chemistry test from the one-hour to the two-hour. On the one-hour test their cut-score was a very low 127. Is the two-hour test much easier than the one-hour test? If not, I do not understand these vastly different standards. Is Dupont laying off chemists, thereby providing a surplus of potential teachers? Please leave a comment if you know something.

Friday, August 04, 2006

Save the Data

In my previous posts I’ve presented evidence for how much (or really how little) mathematics our secondary math teachers need to know to be annointed “highly qualfied”. The Reader’s Digest version: On the Praxis II test, a test whose content is at the advanced high school level, teachers can gain “highly qualified” status even if they miss 50% of the questions. In some states the miss rate climbs above 70% all the way to 80%. If this test were graded like a typical high school exam, about 4 out of 5 of the prospective teachers would fail.

In this post I will look at a related question: “How Much Math Should Math Teachers Know?”; that is, what evidence is there for a correlation between teacher math knowledge and student math achievement? I touched on this topic briefly in my previous post. Let’s look at some details.

The bottom line here is that we don't know. The research is largely uninformative. In a 2001 review of research entitled &ldquoTeacher Preparation Research: Current Knowledge, Gaps and Recommendations”, Wilson et. al. state:
We reviewed no research that directly assessed prospective teachers’ subject matter knowledge and then evaluated the relationship between teacher subject matter preparation and student learning.
They reviewed no such studies, because no large-scale studies of this type existed. An opportunity was missed with the TIMSS study. In a previous post, I wondered why the TIMSS study didn’t also test the teachers. Such a study could have been quite informative. If it showed a significant difference in subject matter knowledge between U.S. teachers and teachers from countries with superior student results, then teacher preparation should get more attention. If not, then we can primarily look elsewhere for solutions. Both the magnitude of any differences in teacher knowledge and its possible correlation with student achievement would be of interest. When a very small study was done of Chinese versus U.S. elementary teachers, huge differences were found.

Studies of the effect of teachers’ math knowledge use indirect proxies for teachers’ math knowledge. The typical proxies used in these studies are based on the teachers exposure to college level math. For example, did they have a major or minor?; or simply how many college math courses did they take. It was plausible that math majors would be better at high school level math than others. If so this would be a reasonable proxy.

The data says something different. My analysis of teacher testing results revealed the surprising fact that math and math education majors do not exhibit mastery of high school level math. Nor do they do any better than other technical majors on the Praxis II. That means the proxies are poor. The minimal or non-existent correlation shown by the studies Wilson reviewed is therefore entirely consistent with my teacher testing data, even if a strong correlation exists between teacher math mastery and student achievement.

Wilson makes similar observations:
The research that does exist is limited and, in some cases, the results are contradictory. The conclusions of these few studies are provocative because they undermine the certainty often expressed about the strong link between college study of a subject matter area and teacher quality. ...

But, contrary to the popular belief that more study of subject matter (e.g., through an academic major) is always better, there is some indication from research that teachers do acquire subject matter knowledge from various sources, including subject-specific academic coursework (some kinds of subject-specific methods courses accomplish the goal). There is little definitive research on this question. Much more research needs to be done before strong conclusions can be drawn on the kinds or amount of subject matter preparation that best equip prospective teachers for classroom practice.

Some researchers have found serious problems with the typical subject matter knowledge of preservice teachers, even of those who have completed majors in academic disciplines. In mathematics, for example, while preservice teachers’ knowledge of procedures and rules may be sound, their reasoning skills and knowledge of concepts is often weak. Lacking full understanding of fundamental aspects of the subject matter impedes good teaching, especially given the high standards called for in current reforms. Research suggests that changes in teachers’ subject matter preparation may be needed, and that the solution is more complicated than simply requiring a major or more subject matter courses. [emphasis added]
Requiring a math or math education major, as some states do, is no guarantee of mathematical mastery. There is no control over the quality of the courses, or the reliability of the grades. There is no quantitative measure of how much was learned. Even if there was, it is debatable to what extent exposure to college level course work correlates with mastery of high school level math. (In my study, math majors had a mean score that was essentially at the minimal ability level. This level is almost 40 points, on a 100 point scale, below what I would call mastery.) Teacher licensure tests could provide a more reliable direct measurement of that mastery.

Without clear and convincing evidence, the interpretation of studies is subject to confirmation bias
Confirmation bias refers to a type of selective thinking whereby one tends to notice and to look for what confirms one's beliefs, and to ignore, not look for, or undervalue the relevance of what contradicts one's beliefs.
Every human being operates with both knowledge and beliefs. However, sometimes they confuse their beliefs for knowledge.

I believe that a deep, grade relevant, understanding of mathematics is essential to great mathematics teaching. I don’t think you need a math major. I do believe you also need some knowledge of how to teach, of how to control a class, of how to manage a classroom, of how to assess a student, and of how to deal with parents and administrators. I believe it takes years to acquire the necessary math skill. I believe it would take only weeks to aquire the other skill set, at least that part that can be taught in a classroom, if it were efficiently organized and if you already have decent people skills. There were great math teachers before there were schools of education, but I have yet to meet a great math teacher who doesn't know math. It also helps to have good textbooks and a rational curriculum.

As scientist I am willing to change my beliefs when presented with data. The relevant experiments are becoming easier to do, if only the data was preserved and made publicly accessible. A lot of educational research reminds me of the man that’s looking for his lost keys by the lamp post because the light is better there. Education researchers use the data that is convenient without sufficient attention to the relevancy of that data to the questions they are trying to answer. I have some sympathy for both the man and the researchers. I would probably first look where the light was good. After all, maybe the keys are there. But when you cannot find them, after a thorough search, it is time to look elsewhere.

Some 36 states now use the Praxis II to test prospective mathematics teachers. The questions on this exam go through an elaborate vetting process (see here, 90 page PDF). Unfortunately, most of the richness of this data set is discarded. What is preserved is a pass/fail decision, the criteria for which varies enormously from state to state. That’s not good enough.

Save the data!

Wednesday, August 02, 2006

History Lesson II

In the previous post we saw that states have an incentive to skew their data on student testing. There is a similar dynamic at play in teacher testing. The NCLB requires that teachers demonstrate competence “as defined by the state, in each core academic subject he or she teaches.”

States have free reign. They can use their own tests, and most of the big states do. Even on ETS tests, they set their own passing scores. Most states require that their new teachers pass some sort of subject matter competency test, but verteran teachers can opt to by-pass any direct measure of competence by jumping through a few additional hoops called HOUSSE.

Such a system creates lots of paperwork headaches for lots of educrats, but has little chance of actually accomplishing the goal of improving teacher quality. It is a system that creates pressure for the states to simply define low standards that assure their own success, rather than make politically difficult changes to improve the quality of the teaching force. The federal government only seems to care if the states are meeting their self-defined standards. It is a wonder that any states ever come up short. To understand, in detail, what the state standards really are, is a laborious task. I know of only one significant attempt.

In their 1999 report Not Good Enough, the Education Trust examined the content and passing criteria for a large number of such tests. They came close to catching the states in full fledge deception mode. But for one major oversight (to be explained shortly) they might have revealed one the states’ clever tricks to obfuscate performance. Unfortunately their report didn’t get the attention it deserved and the deception has continued into the NCLB era.

On the test of secondary mathematics content knowledge (the Educational Testing Service’s Praxis II 0061 test), the Education Trust reported that two states set passing standards below 50%. Fifty percent seems to be a psychologically important threshold, so this finding was highlighted in several subsequent studies. For example, the following appears in this 2000 report Preparing and Supporting New Teachers prepared by SRI researchers for the U.S. Department of Education:
Critics argue that the teacher tests are too easy and that the passing scores are benchmarked very low in most states. For example, on the Praxis II mathematics content tests, teacher candidates in Pennsylvania and Georgia can pass with fewer than 50 percent of the items answered correctly (Education Trust, 1999).
This is on a test that Not Good Enough told us was largely at the high school level, and could be passed by a B+ high school student.

This low pass score problem got some attention, but it just wasn’t a big enough issue. After all, only two of the thirteen states set pass scores this low, and both were almost at 50%. Besides these were standards that defined the minimal ability beginning teacher, not the ”highly qualified” teacher of today. In addition the problem was left unquantized, that is we didn't know how many of these barely passing teachers were actually teaching.

An additional problem with Not Good Enough was that the Education Trust’s policy recommendations were so unrealistic. In their 2000 report Generalizations in Teacher Education: Seductive and Misleading Gitomer and Lantham state:
Finally, there is increasing policy debate concerning the raising of passing standards for teacher licensure tests. Organizations like the Education Trust (1999) have proposed deceptively simple solutions, such as “raising the bar” for teachers by requiring them to meet far more stingent testing guidelines than are currently in place in order to earn a license to practice. This myopic perspective, however, fails to acknowledge the complexity of the issues embedded in teacher reform. While higher passing standards would elevate the academic profile of those who pass by reducing the pool of candidates and selectively removing a group of individuals with lower mean SAT scores, higher passing standards would also limit the supply substantially. If the highest passing scores currently used in any one state were implemented across all states, fewer than half the candidates would pass Praxis I, and fewer than two thirds would pass Praxis II. Without other interventions the supply of minority candidates would be hit the most severely. For example, only 17% of the African-American candidates would pass Praxis I, and just one third would pass Praxis II. The dramatic effects that would be brought about by raising passing standards require careful policy analysis.
So what educrat would want to raise standards if these would precipitate a crisis of quantity and diversity in the teacher workforce?

Unfortunately, the Education Trust’s data, while technically accurate, was misleading. In a previous post, “The Highly Qualified Math Teacher”, I showed how the pass scores used by the Education Trust grossly overstate the teacher examinee knowledge because the Praxis II tests allow guessing without penalty. Under these conditions an examinee with zero content knowledge still gets 25% of the questions right. The knowledge represented by that 46% raw score shrinks considerably when you realize it is on a scale where zero knowledge earns a 25% raw score.

The Education Trust’s numbers can be adjusted to account for this condition. With this adjustment zero content knowledge maps into the expected zero percent. In the table below I reproduce the Education Trust’s table, but add a column with this adjustment.

The following table, taken from Not Good Enough, shows the 1999 performance of teachers taking the 0061 exam. The second column gives the 1999 pass score (or cut score) for each state. The third column is the percentage of correct answers that corresponds to the pass score. The fourth column is an adjustement to third column that corrects for the “free guessing” effect. The last row is also added.

Praxis II (0061) cut scores by state (1999)
StatePassing Score (1999)Estimated
% Correct
to pass
Adjusted
% Correct
to pass
Oregon1476553
Connecticut1416047
DC1416047
Kentucky1416047
Missouri1375742
Arkansas1365641
Hawaii1365641
Tennessee1365641
North Carolina1335337
West Virginia1335337
New Jersey1305135
Pennsylvania1274932
Georgia1244628
Knows Nothing10025  0
Table 1. Table from Not Good Enough. The fourth column and last row are added.

Somehow, for all their diligence in analyzing this test and compiling this data, the Education Trust missed this important correction. They did not mention that the Praxis II allows free guessing. They did not tell their readers that 25% would represent zero content knowledge. So no one reading their report could even infer that such a correction was needed.

What if they had reported that 12 of 13 states set passing scores at a level of knowing less than 50%, several under 40%, of this high school level material? This issue would have received a lot more serious attention. At some point the standards are so low and so widespread that they just cry out for attention.

History Lesson

A recurring theme in my posts will be the importance of understanding data before drawing conclusions from it. Seems obvious, but putting it into practice requires a lot of work. This is especially true in education, where spin is usually far more important than careful analysis. What Ken DeRosa disparagingly calls Stinky Research.

The most flagrant example of this was the the scandal in student testing that occurred in 1980's. There is a clear analogy between this student testing scandal and the teacher testing issues this blog is addressing, so it is worth reviewing the history. The plot was uncovered by a medical doctor, John Jacob Cannell, who began to question the spin:

My education about the corruption of American public school achievement testing was a gradual process. It started in my medical office in a tiny town in the coal fields of Southern West Virginia, led to school rooms in the county and then the state, to the offices of testing directors and school administrators around the country, to the boardrooms of commercial test publishers, to the office of the U.S. Secretary of Education, to schools of education at major American universities, to various governors’ offices, and finally, to two American presidents.

One day in 1985, West Virginia newspapers announced all fifty-five West Virginia counties had tested above the national average. Amid the mutual congratulations, I asked myself two things. How could all the counties in West Virginia, perhaps the poorest and most illiterate state in the union, be above the national average? Moreover, if West Virginia was above average, what state was below?

In my Flat Top, West Virginia, clinic, illiterate adolescent patients with unbelievably high standardized achievement test scores told me their teachers drilled them on test questions in advance of the test. How did the teachers know what questions would be on a standardized test?

Then I learned that West Virginia schools, like most other states, used what seemed to me as a physician to be very unusual standardized tests. Unlike the standardized tests that I knew - such as college entrance, medical school admission, or medical licensure examinations - public school achievement exams used the same exact questions year after year and then compared those scores to an old, and dubious, norm group - not to a national average. Furthermore, educators - the group really being tested - had physical control of the tests and the teachers administered them without any meaningful test security.
Please read the whole thing.

States still administer their own tests. But cheating has been made more difficult because students also take the national NAEP tests. For example, we have this New York Times report:
Students Ace State Tests, but Earn D's From U.S.

A comparison of state test results against the latest National Assessment of Educational Progress, a federal test mandated by the No Child Left Behind law, shows that wide discrepancies between the state and federal findings were commonplace. ...

States set the stringency of their own tests as well as the number of questions students must answer correctly to be labeled proficient. And because states that fail to raise scores over time face serious sanctions, there is little incentive to make the exams difficult, some educators say.
One of the big political compromises in NCLB was the extent to which states retained almost complete control over the quanitative aspects of testing. They define the tests and the mapping of test scores into broad categories. Even now, with the NAEP results as oversight, states still skew their student testing for political advantage. See, for example, this related story Exploiting NCLB With Social Promotion.

In teacher testing there is virtually no oversight. States choose the pass scores that define what “highly qualified” means. Something akin to the NAEP is needed for teacher testing. It could be as simple as using the Praxis II and defining additional categories as was done in Table 2 of my The Highly Qualified Math Teacher post. Note that this doesn’t prevent any state from hiring anyone they want. It just prevents them from labeling the teacher “highly qualified” unless there’s some proof. Without some kind of objective standards, putting a “highly qualified” teacher in every classroom has little meaning.

HQT Q&A #1

In this post I respond to some questions that were raised about my post The Highly Qualified Math Teacher

1. In Table 1 how did you calculate the last column? I suppose that what I am asking is how raw scores were converted into scaled scores for the Praxis II exam.
In standardized tests, the raw to scaled score conversion can change with each administration of the test, though usually not by much. The only test-to-test number that is comparable for psychometrically sound tests is the scaled score. The Education Trust must have had access to the actual raw to scaled score conversion tables for the tests they analyzed in order to produce the table on page 22 of Not Good Enough. In their table they produce an estimate of the percent correct, based on the now known raw score. Of course, from their percentage I can calculate the corresponding raw score.

Their results match up nicely with a sample raw to scaled score conversion table found on page 201 of Cracking the Praxis by Fritz Stewart and Rick Sliter (you can go to books.google.com and google mathematics Praxis II. This book shows up as one of the first few entries.) One thing that is obvious from this more complete table is that raw scores between 0 and 14 all map into the lowest scaled score of 100. This is consistent with the fact that on average an examinee will get a raw score of 12-13 just by randomly guessing at all the questions, and so these raw scores represent no content knowledge.

A scaled score of 124 corresponds to 46% correct responses in both the Not Good Enough and Cracking the Praxis tables. Since there is no penalty for guessing, we may safely assume the examinee will answer all the questions. On this 50 question test the examinee got 23 (46%) right and 27 wrong. Applying the appropriate (4 answer choices per problem) SAT-like penalty of -1/3 point per incorrect response the adjusted raw score is 23 - 27/3 = 14 and so the adjusted percent correct is (14/50)*100% = 28%.

You can also think of it this way: Suppose an examinee can solve 28% of the problems on this exam. If he randomly guesses at the rest, which he will do because there is no penalty, what is his expected raw score? He gets the 14 he knows how to solve plus 1/4 of the 36 he guesses at, i.e. the 23 we started with from the original raw to scaled score.

This correction method can be reduced to the equation:

Pa = (4/3)(Pr - 25%)

where Pa (e.g. 28%) is the adjusted percentage and Pr (e.g. 46%) is the originally reported raw percentage.

One can also derive the same equation by using three parameter Item Response Theory with the guessing parameter set at 0.25. You can think of the above as a simple linear rescaling from the non-intuitive 25%-100% range to the more intuitive 0%-100% range. If you want to use 46%, you must keep in mind that this is on scale where 25% represents zero content knowledge.

Of all these explantions, I prefer the SAT-like penalty one the best, because this practice is used on the multiple-choice tests that should be familiar to a general audience (the SAT has five answer choices per question and penalizes -1/4 point for an incorrect response). As I argued in my previous post this adjusted score IS the score the examinee would get on the test with penalties, because optimal strategy would be to continue to guess on any question on which he can eliminate at least one answer choice and the penalty is neutral in those cases where he cannot eliminate any answer choices.

Also every caveat I can come up with tends to drive the scores even lower. What we would really like to know is how the examinee would perform on a constructed response exam, i.e. an exam where no answer choices are given. To estimate this you have to go beyond three parameter Item Response Theory and make other assumptions. If I use the Garcia-Perez model in our previous case the estimated percent correct drops from 46% on a multiple-choice test without penalty to 28% on the multiple-choice test with penalty to 21% on a constructed response exam. The Strauss estimate, based on the scaled score alone, would be 24%. Furthermore, since this test has a margin of error, and the examinee can take it multiple times, and we are examining the bottom end where multiple times is likely — we should adjust these scores downward by some fraction of the scoring error margin.


2. How did you calculate (or estimate) the numbers in Table 2?
I got them directly from ETS. ETS kindly agreed to meet with me and we spent several hours going over data. Prior to meeting with ETS , I estimated similar numbers by fitting a normal distribution from the summary statistics that accompanied my score report. (Yes I actually took this test. Even though some of my high school math was rusty from 30 years of non-use, I still scored above a 190 and am convinced I would have scored a 200 as a high school senior) Those statistics told me that the first and third quartile scores were 128 and 157 respectively, from which I got a normal distribution with a mean of 142.5 and a standard deviation of 21.5. As an additional check the ETS defines an ROE (recognition of excellence) level for a subset of their Praxis II exams, which for the 0061 test was a 165 and was at the 85th percentile. This matched my normal distribution almost perfectly.

3. A related question is how did you come up with the numbers in the paragraph just below Table 2, particularly the 15% for the percentage who “demonstrate competency”?
From the normal distribution. Of course, the scaled to raw score table I am using is quantized, so what I really find is that 15.8% of the examinee population scores at or above a 164, which corresponds to an adjusted percent correct of 70.7%. It is a somewhat arbitrary assumption on my part that an adjusted score of 70% defines competency.

What we would really like to know is what are these precentages in the existing teacher pool. All we have now is estimates based on the examinee pool. But in math, some 25% never pass so these are eliminated. We have reason to believe that at the upper end of the distribution a disproportiately large number either never enter teaching or leave teaching early. So my best guess is that the bottom end isn't as bad, nor the top end as good as these examinee numbers suggest.