Skip to:

Teacher Evaluation

  • Teacher Evaluations: Don't Begin Assembly Until You Have All The Parts

    Written on July 19, 2011

    ** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

    Over the past year or two, roughly 15-20 states have passed or are considering legislation calling for the overhaul of teacher evaluation. The central feature of most of these laws is a mandate to incorporate measures of student test score growth, in most cases specifying a minimum percentage of a teacher’s total score that must consist of these estimates.

    There’s some variation across states, but the percentages are all quite high. For example, Florida and Colorado both require that at least 50 percent of an evaluation must be based on growth measures, while New York mandates a minimum of 40 percent. These laws also vary in terms of other specifics, such as the degree to which the growth measure proportion must be based on state tests (rather than other assessments), how much flexibility districts have in designing their systems, and how teachers in untested grades and subjects are evaluated. But they all share that defining feature of mandating a minimum proportion – or “weight” – that must be attached to a test-based estimate of teacher effects (at least for those teachers in tested grades and subjects).

    Unfortunately, this is typical of the misguided manner in which many lawmakers (and the advocates advising them) have approached the difficult task of overhauling teacher evaluation systems. For instance, I have discussed previously the failure of most systems to account for random error. The weighting issue is another important example, and it violates a basic rule of designing performance assessment systems: You should exercise extreme caution in pre-deciding the importance of any one component until you know what the other components will be. Put simply, you should have all the parts in front of you before you begin the assembly process.

    READ MORE
  • The Faulty Logic Of Using Student Surveys In Accountability Systems

    Written on July 11, 2011

    In a recent post, I discussed the questionable value of student survey data to inform teacher evaluation models. Not only is there little research support for such surveys, but the very framing of the idea often reflects faulty reasoning.

    A quote from a recent Educators 4 Excellence white paper helps to illustrate the point:

    For a system that aims to serve students, young people’s interests are far too often pushed aside. Students’ voices should be at the forefront of the education debate today, especially when it comes to determining the effectiveness of their teacher.

    This sounds noble… but seriously, why should students’ opinions be "at the forefront of the education debate"? Are students’ needs better served when we ask students what they need directly? Research on this is explicit: no, not really.

    READ MORE
  • Student Surveys of Teachers: Be Careful What You Ask For

    Written on June 23, 2011

    Many believe that current teacher evaluation systems are a formality, a bureaucratic process that tells us little about how to improve classroom instruction. In New York, for example, 40 percent of all teacher evaluations must consist of student achievement data by 2013. Additionally, some are proposing the inclusion of alternative measures, such as “independent outside observations” or “student surveys” among others. Here, I focus on the latter.

    Educators for Excellence (E4E), an “organization of education professionals who seek to provide an independent voice for educators in the debate surrounding education reform”, recently released a teacher evaluation white paper proposing that student surveys account for 10 percent of teacher evaluations.

    The paper quotes a teacher saying: “for a system that aims to serve students, young people’s interests are far too often pushed aside. Students’ voices should be at the forefront of the education debate today, especially when it comes to determining the effectiveness of their teacher." The authors argue that “the presence of effective teachers […] can be determined, in part, by the perceptions of the students that interact with them." Also, “student surveys offer teachers immediate and qualitative feedback, recognize the importance of student voice […]". In rare cases, the paper concedes, “students could skew their responses to retaliate against teachers or give high marks to teachers who they like, regardless of whether those teachers are helping them learn."

    But student evaluations are not new.

    READ MORE
  • The Ethics of Testing Children Solely To Evaluate Adults

    Written on June 1, 2011

    The recent New York Times article, “Tests for Pupils, but the Grades Go to Teachers," alerts us of an emerging paradox in education – the development and use of standardized student testing solely as a means to evaluate teachers, not students. “We are not focusing on teaching and learning anymore; we are focusing on collecting data," says one mother quoted in the article. Now, let’s see: collecting data on minors that is not explicitly for their benefit – does this ring a bell?

    In the world of social/behavioral science research, such an enterprise – collecting data on people, especially on minors – would inevitably require approval from the Institutional Review Board (IRB). For those not familiar, IRB is a committee that oversees research that involves people and is responsible for ensuring that studies are designed in an ethical manner. Even in conducting a seemingly harmless interview on political attitudes or observing a group studying in a public library, the researcher would almost certainly be required to go through a series of steps to safeguard participants and ensure that the norms governing ethical research will be observed.

    Very succinctly, IRBs’ mission is to see that (1) the risk-benefit ratio of conducting the research is favorable; (2) any suffering or distress that participants may experience during or after the study is understood, minimized, and addressed; and (3) research participants’ agreed to participate freely and knowingly – usually, subjects are requested to sign an informed consent which includes a description of the study’s risks and benefits, a discussion of how confidentiality will be guaranteed, a statement on the voluntary nature of involvement, and a clarification that refusal or withdrawal at any time will involve no penalty or loss of benefits. When the research involves minors, parental consent and sometimes child assent are needed.

    In short, IRB procedures exist to protect people. To my knowledge, student evaluation procedures and standardized testing are exempt from this sort of scrutiny. So the real question is: Should they be? Perhaps not.

    READ MORE
  • Value-Added In Teacher Evaluations: Built To Fail

    Written on May 31, 2011

    With all the controversy and acrimonious debate surrounding the use of value-added models in teacher evaluation, few seem to be paying much attention to the implementation details in those states and districts that are already moving ahead. This is unfortunate, because most new evaluation systems that use value-added estimates are literally being designed to fail.

    Much of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.

    This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain.

    READ MORE
  • The New Layoff Formula Project

    Written on April 27, 2011

    In a previous post about seniority-based layoffs, I argued that, although seniority may not be the optimal primary factor upon which to base layoff decisions, we do not yet have an acceptable alternative in most places—one that would permit the “quality-based” layoffs that we often hear mentioned. In short, I am completely receptive to other layoff criteria, but insofar as new teacher evaluation systems are still in the design phase in most places, states and districts might want to think twice before chucking a longstanding criterion that has (at least some) evidence of validity before they have a workable replacement.

    The New Teacher Project (TNTP) recently released a short policy brief outlining a proposed alternative. Let’s take a quick look at what they have to offer.

    READ MORE
  • In Performance Evaluations, Subjectivity Is Not Random

    Written on March 18, 2011

    Employment policies associated with unions – e.g., seniority, salary schedules – are frequently criticized for not placing the highest premium on performance. Detractors also argue that such policies, originally designed to protect workers against discrimination (by gender, race, etc.), are no longer necessary now that federal laws are in place. Accordingly, those seeking to limit collective bargaining among teachers have proposed that current policies be replaced by “performance-based” evaluations – or at least a system that would make it easier to reward and punish based on performance.

    Be careful, argues Samuel A. Culbert in a recent New York Times article, “Why Your Boss is Wrong About You." Culbert warns that there are serious risks to deregulating the employment relationship, and leaving it even partially in the hands of the employer and his/her performance review:

    Now, maybe your boss is all-knowing. But I’ve never seen one that was. In a self-interested world, where imperfect people are judging other imperfect people, anybody reviewing somebody else’s performance ... is subjective.
    This viewpoint may sound obvious, but social science research reminds us that the whims of subjective human judgment are not random. The inefficiencies that Culbert mentions are inevitable, but so is the fact that bias tends to operate in a manner that disproportionately affects workers from traditionally disadvantaged social groups, such as women and African Americans. What’s worse – it’s just as likely to occur within as between groups, and we often do it without realizing.
    READ MORE
  • Value-Added: Theory Versus Practice

    Written on February 18, 2011

    ** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

    About two weeks ago, the National Education Policy Center (NEPC) released a review of last year’s Los Angeles Times (LAT) value-added analysis – with a specific focus on the technical report upon which the paper’s articles were based (done by RAND’s Richard Buddin). In line with prior research, the critique’s authors – Derek Briggs and Ben Domingue – redid the LAT analysis, and found that teachers’ scores vary widely, but that the LAT estimates would be different under different model specifications; are error-prone; and conceal systematic bias from non-random classroom assignments.  They were also, for reasons yet unknown, unable to replicate the results.

    Since then, the Times has issued two responses. The first was a quickly-published article, which claimed (including in the headline) that the LAT results were confirmed by Briggs/Domingue – even though the review reached the opposite conclusions. The basis for this claim, according to the piece, was that both analyses showed wide variation in teachers’ effects on test scores (see NEPC’s reply to this article). Then, a couple of days ago, there was another response, this time on the Times’ ombudsman-style blog. This piece quotes the paper’s Assistant Managing Editor, David Lauter, who stands by the paper’s findings and the earlier article, arguing that the biggest question is:

    ...whether teachers have a significant impact on what their students learn or whether student achievement is all about ... factors outside of teachers’ control. ... The Colorado study comes down on our side of that debate. ... For parents and others concerned about this issue, that’s the most significant finding: the quality of teachers matters.
    Saying “teachers matter” is roughly equivalent to saying that teacher effects vary widely - the more teachers vary in their effectiveness, controlling for other relevant factors, the more they can be said to “matter” as a factor explaining student outcomes. Since both analyses found such variation, the Times claims that the NEPC review confirms their “most significant finding."

    The review’s authors had a much different interpretation (see their second reply). This may seem frustrating. All the back and forth has mostly focused on somewhat technical issues, such as model selection, sample comparability, and research protocol (with some ethical charges thrown in for good measure). These are essential matters, but there is also an even simpler reason for the divergent interpretations, one that is critically important and arises constantly in our debates about value-added.

    READ MORE
  • Premises, Presentation And Predetermination In The Gates MET Study

    Written on January 12, 2011

    ** Also posted here on “Valerie Strauss’ Answer Sheet” in the Washington Post

    The National Education Policy Center today released a scathing review of last month’s preliminary report from the Gates Foundation-funded Measures of Effective Teaching (MET) project. The critique was written by Jesse Rothstein, a highly-respected Berkeley economist and author of an elegant and oft-cited paper demonstrating how non-random classroom assignment biases value-added estimates (also see the follow-up analysis).

    Very quickly on the project: Over two school years (this year and last), MET researchers, working  in six large districts—Charlotte-Mecklenburg, Dallas, Denver, Hillsborough County (FL), Memphis, and New York City—have been gathering an unprecedented collection of data on teachers and students, grades 4-8.  Using a variety of assessments, videotapes of classroom instruction, and surveys (student surveys are featured in the preliminary report), the project is attempting to address some of the heretofore under-addressed issues in the measurement of teacher quality (especially non-random classroom assignment and how different classroom practices lead to different outcomes, neither of which are part of this preliminary report). The end goal is to use the information to guide in the creation of more effective teacher evaluation systems that incorporate high-quality multiple measures.

    Despite my disagreements with some of the Gates Foundation’s core views about school reform, I think that they deserve a lot of credit for this project. It is heavily-resourced, the research team is top-notch, and the issues they’re looking at are huge.  The study is very, very important — done correctly. 

    But Rothstein’s general conclusion about the initial MET report is that the results “do not support the conclusions drawn from them." Very early in the review, the following assertion also jumps off the page: "there are troubling indications that the Project’s conclusions were predetermined."

    READ MORE
  • The War On Error

    Written on December 7, 2010

    The debate on the use of value-added models (VAM) in teacher evaluations has reached an impasse of sorts. Opponents of VAM use contend that the imprecision is too high for the measures to be used in evaluation; supporters argue that current systems are inadequate, that all measures entail error but this doesn’t preclude using the estimates. 

    This back-and-forth may be missing the mark, and it is not particularly useful in the states and districts that are already moving ahead. The more salient issue, in my view, is less about the amount of error than about how it is dealt with when the estimates are used (along with other measures) in evaluation systems.

    Teachers certainly understand that some level of imprecision is inherent in any evaluation method—indeed, many will tell you about colleagues who shouldn’t be in the classroom, but receive good evaluation ratings from principals year after year. Proponents of VAM often point to this tendency of current evaluation systems to give “false positive” ratings as a reason to push forward quickly. But moving so carelessly that we disregard the error in current VAM estimates—and possible methods to reduce its negative impacts—is no different than ignoring false positives in existing systems.

    READ MORE

Pages

Subscribe to Teacher Evaluation

DISCLAIMER

This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from shankerblog.org. The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the Shanker Blog may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.