LUNCHEON RECESS
Licensing - Gary Philips
Afternoon Session Questions
Federalism and Inter-Governmental Relations
Presentations:
Bruce McDowell
Beryl Radin
John Shannon
Jack Knott
Summations
P R O C E E D I N G S
8:35 a.m.
MR. SHAVELSON: Good morning. On behalf of the Board on Testing and Assessment and the National Academy of Science, it's my pleasure to welcome you to our workshop on National Tests, Regulatory and Licensing Issues.
I'm Rich Shavelson, I'm Chair of BOTA. BOTA is a standing Board of the National Research Council, which is the research arm of the National Academy of Sciences. BOTA's charge is to advise the Federal Government and nation on issues of testing and assessment, policy and practice.
Its purview is broad, encompassing the multitude of test uses in our society, including education, health, labor, and the military. Consequently, this workshop on President Clinton's proposal for national tests of Reading and Mathematics, fits squarely into our agenda.
Let me say a few words about BOTA and its members before we jump into the agenda today. We're fortunate to have with us today, seven members from the Board on Testing and Assessment. We have on my left here, Bob Linn, Vice Chair of BOTA and distinguished professor of education at the University of Colorado, and Chris Edley who is professor of law at Harvard University.
In addition, we have Art Goldberger who's a member of the National Academy of Sciences and professor of economics at Wisconsin; Carl Kaestle, who's the immediate, past President of the National Academy of Education and professor of History and Education at Brown University.
I didn't see Richard Duran. Is Richard here? He's on the way. Richard is coming in from UC Santa Barbara, but if the weather's nice he may not make it. And we have Bill Taylor, attorney, and currently a visiting professor of Law and Education at Stanford University.
We thank them for their dedication to BOTA and the NRC, and for their willingness to share some of their time on behalf of the issues we're about to address today. It's worth noting again that all of our members act as volunteers, which is one way the Academy aspires towards objective and independent, scientific judgment.
A couple of words about the projects that are going on within the Board. We're currently involved in several activities that may be of interest to the group. We're involved in a 3-year evaluation of the National Assessment of Educational Progress; we're involved in a new study of Title I assessment which is about to begin this summer; a new roundtable on work, learning, and assessment, designed to bring together people from business, education, and government for discussions about schooling and the future of work, learning, employment testing, and post-secondary admissions.
We have a new project that will review and synthesize research from cognitive and neuropsychology, and its implications for assessment. And finally, last but not least, we have a new book coming out, Educating One and All, Standards-Based Reform and Students with Disabilities. It will be published next week.
In addition, the Board holds periodic workshops and colloquia on selected topics. Some of our reports are on topics as varied as IQ tests and special education, Goals 2000 and standards-based reform, the general aptitude test battery, and we're currently holding conferences on TIMSS and on science assessment.
Next week we will launch a new bi-monthly luncheon colloquium series with a visit of Nancy Cole, President of ETS, who will discuss the major study of gender bias in educational testing.
BOTA projects are sponsored mostly by the Federal Government; the Departments of Defense, Education, Labor, School and Work Opportunities, and the National Science Foundation, are or have been sponsors of our work. In addition, we would also like to acknowledge the support of several foundations: The PEW Charitable Trusts, the Spenser Foundation, and the TW Grant Foundation.
This conference was conceived and organized solely by BOTA and the BOTA staff and is paid for by core funding we receive under a cooperative agreement with the Department of Education. We are very pleased to be able to organize events such as this one, designed to provide a forum for sharing of the information and the integration of social knowledge, science knowledge, and policy and planning.
Finally, I'd like to thank the many individuals who are not currently members of the Board, who have agreed to take time to share with us their expertise. I won't name all of them -- you can find their names in your program -- but I do want to extend a heartfelt thanks for their time and efforts.
Now, it's my pleasure to introduce my colleague, and the Vice Chair of BOTA, Bob Linn.
MR. LINN: When President Clinton announced in the State of the Union the idea of a national test, it was a remarkable event. And when you think of the period of time since February 4th when that happened to today, just barely over four months later, a great deal has happened. It shows how fast the government bureaucracy can move when the President decides it's going to.
The national test is potentially important symbolically and practically, in terms of what impact it may have on education. It's obviously being seen as a major tool of educational reform, and it is within the spirit of the Standards movement that the Administration has backed for some time.
The Board, when we met in March, discussed the national test at some length, and we thought that there was a great deal that we should take advantage of in learning from efforts that have been made by States and districts of our country in the past, who have turned to tests as a major tool of educational reform.
So when we have a new program at the national level before us, we thought it's important that we could look at what could be learned from those past experiences, both in terms of what works well and what some of the risks or downsides there may be, when a test is put in place that may have impacts that are not always as positive as those intended by the policymakers when they put them in place.
It's also clear to the Board how well a system is going to work depends on the complex of technical issues, policy issues, and how governance takes place in the way the test is administered and the scores are used.
The goal then, was to think of ways that we could maximize the intended benefits of such a testing program while minimizing the risks, eliminating them to the extent possible.
Now, there are many issues with regard to the national test that might have been addressed in a 2-day conference like this. There is important issues about the context of the test. If you're a Math educator or someone concerned with reading, you obviously know that it makes a big difference what goes into the nature of that instrument that's out there. Is it going to represent the kind of Mathematics that people have in mind with the standards, for example.
There are also issues of how this is going to fit in with the States' own assessment programs, etc. Now, our intent is not to focus on any of those issues, however. The issues for this conference, really is on the licensing regulations. What sort of mechanisms can be put in place to support quality control? How it is that you're going to license and have regulations that maximize the benefits and minimize or eliminate the risks? The unintended, negative impact that the tests might have.
So the principal role of the meeting is in keeping with part of BOTA's charge. BOTA is in part, set up to provide a scientific forum and advice to the government on policy issues -- in particular, policy issues related to testing. Our goal here then, is to end up with some constructive discussion and suggestions that can improve the test.
We could come here to debate whether or not it's a good policy to have this testing program; whether it was the right idea in the first place. That's not the intent of this forum. The intent is not really to have a discussion about, but rather, given that we have this policy in place at this point, how can we make the best use of it, how can we ensure that we avoid the possible downsides that I know many of you are concerned about.
So in keeping with that, the meeting is set up as you see, we will start with some overview and background on the purposes of the assessment from Gary Philips; then we'll move into a session that talks about some of what we know about the potential risks and unintended consequences of testing from other experience with testing.
The second session will deal with what we, as a profession, what the profession of measurement and professionals concerned with testing, have set up by way of mechanisms to ensure quality and to avoid misuse, such as: the standards for educational and psychological testing, the Code of Fair Testing Practices, and mechanisms that might be set up in addition to those.
The third session, the afternoon, will deal with what we can learn from other government agencies; other experiences outside the realm of testing that might be relevant for learning about how we might do a better job in this arena. In many ways, it's a new idea to have something like licensed agencies for the Administration's scoring and reporting of test scores.
At the end of the day, we'll try to come up with a few key questions so that we'll have something to do tomorrow, and tomorrow really, is to pull that together with panels to see if we can come to some agreement on sensible advise to give to the government with regard to this important initiative.
So with that, I'm ready to turn it over to the Workshop Chair -- who Rich has already introduced -- Chris Edley, who will be chairing the meeting. Thank you.
CHAIRMAN EDLEY: Thanks, Bob. According to my watch, in order to get on schedule I have to do my throat clearing in negative-2 minutes. So I think the best way to do that is for me to simply introduce Michael Feuer who is the Director of BOTA who, along with Pattie Morrison, has really done the great work, both logistical and intellectual -- in assembling this gathering, and ask Michael to introduce our first speaker. Michael?
MR. FEUER: Good morning.
CHAIRMAN EDLEY: Be brief.
MR. FEUER: Thank you. Fortunately, we have a requirement for the BOTA staff that they take some training in improvisational theater. I didn't anticipate having anything to say here this morning -- just to listen. But it's a pleasure to welcome you all, and I think Bob and Rich have given you a very good and eloquent description of what our plans are for these two days.
There is actually a theme and a structure to the way these sessions have been organized and I hope that becomes implicitly clear as we go along. But if it doesn't, I just may occasionally remind us what that structure is.
The first thing is to give everybody here the opportunity to actually hear about the origins and status and plans for the Voluntary National Reading and Mathematics Tests, and so it's a pleasure to introduce Gary Philips who is in many ways, if not the architect of the program, he is certainly the chief engineer, and is I think, ready to give us a presentation. So this is Gary Philips from the Department of Education.
MR. PHILIPS: You have a copy of the overheads in your packets, so rather than messing with it I thought maybe we'd just -- you can use what you have.
Well, thank you. I'm very pleased to have this meeting organized by the National Academy of Sciences. This is one of those meetings where the timing is perfect in that we are in the process of constructing the RFP for the licensing of the Voluntary National Tests. We really do need this input and every word you say today I can assure you, we'll be listening to and we will consider.
Right after this meeting, today and tomorrow, we will be reviewing a draft of the RFP internally -- probably late next week -- in the Department, and then shortly after that it will be on the Web for several weeks for public comment. So we're zeroing in on the licensing procedures for the test.
In this new job I've been to lots of meetings with the Secretary and the White House, and every chance they get they talk TIMSS and NAEP. For example, two days ago there was a release of the TIMSS 4th Grade report in the Rose Garden of the White House. And again, the President mentioned the Voluntary National Test and how this is such an important aspect of the whole TIMSS effort.
The whole idea here centers around the fact that the information that TIMSS and NAEP gives is really good information; it's very useful. But not a single student, not a single parent, not a single teacher, has that information about their students. The whole idea here is to take that same kind of information down to the classroom level and to provide it to parents.
The goal is to empower parents and teachers with that information, just like policymakers now have it. So it resonates well and it -- I think it's usually well received. It's hard to argue against giving parents and teachers good information that they don't currently have, and that really is the whole idea of the project.
What I would like to do is to give you a brief overview of what the plans are, and this will lead up to the licensing issues which we'll be getting into as you proceed in the meeting -- give you some background -- you know, you can ask questions as we go along. If you give me like 15 minutes or so, I'll cover some of the basic stuff and then I think you'll have some good information.
So let's do that. If you go to the overhead that says Overview of Plans, that's what I'll be talking about. First of all, the Voluntary National Testing Program is voluntary. It's voluntary in the sense that the Federal Government will not be requiring anybody to take this test.
What the Federal Government is doing is it will be funding the development of the test, standing behind its technical integrity, making sure that it's administered properly, and making it available to districts and States for their use. It really is a testing program. It's a test that will be used by districts and States; this is not a test that the Federal Government is using to collect data.
The tests are intended to provide an overall indication of proficiency in Mathematics at the 8th Grade, and Reading and English at the 4th Grade. When we say overall that means what we're trying to find out is, how well do students read? What is their Math proficiency?
This is not intended to be a diagnostic test, like the kind that districts and States already have, to get intimate information or detailed information about the content or the learning of students. This is to give an overall indication of how well they're doing compared to a national standard, and to international standards as well.
The Reading and the Math will be linked to the NAEP assessment and the TIMSS assessment, and in the case of Mathematics at Grade 8. And there will be a separate RFP and a contract to make sure that linking is done properly.
The items will be released to the public every year. So after the test is administered the items, along with scoring guides and other ancillary materials, will be made available to the public through out Web site and through the Press and that sort of thing.
The first administration is scheduled for 1999, and we're thinking about March as the month of that administration, and we're working now to think through what days and how many days, and things like that.
Okay, let's get into some more detail, which is the second page of your handout. No individually-identifiable data will be sent back to the Federal Government on this test. Not a single test score from a single student will be sent back to Washington. This is not a Federal testing program; it's not a testing program that we're using to collect data.
When we want to get information about how districts and States and the nation is going, we will rely on NAEP; that will continue to be the primary mechanism for understanding and monitoring and reporting on the progress in the States and the districts and the nation.
So no information on the test from individual students comes back to the Federal Government. The only information the Federal Government will get on this test will be the same way that anybody else gets information. If the district or a State produces a report, we'll get a copy of it. That's the information that we'll get back.
There will be no -- not a single dollar of Federal money will be linked to taking the test in the sense -- and what I mean by that is, Federal funding will not be contingent on taking this test. It could be used, for example, in a Federal program to assess students and report on students, but it is not required; no money is contingent on the test.
The test will be consistent with the joint technical standards that are being revised for the APA, AERA, and NCME. Those will be available I think, about the time of the administration of the test. We will make sure that what we do -- is that right, Eva? Okay. What we will do is to make sure that this test is consistent with those standards.
There will be inclusion criteria and appropriate accommodations will be available. We have already committed ourselves to having a bilingual version of the Math test at Grade 8. We won't have that at Grade 4 in Reading because it's reading in English.
Those inclusion criteria, I want you to know that we are absolutely committed to making this a testing program that all students can take. And the inclusion criteria, we'll start with the NAEP inclusion criteria and we'll work from there. And those inclusion criteria and the accommodations will be developed as part of the development process.
When the contract is awarded in September, we in earnest, will get started on that. We will have many meetings on the topics. Lots of people that have an interest in this will have an opportunity to influence the outcome. But the bottom line is, we're committing to make a test that all students can take. So we want to err on the side of inclusion and not on the side of exclusion.
We want to have the tests reported in a metric that parents and teachers can understand. Again, part of the contract is to have focus groups with students, parents, and teachers to work on reporting strategies, so that when we report the results we want to make sure that it's something that they easily grasp, that they understand, that they resonate to, and that they appreciate and can talk about. So a lot of work will go into making the reporting understandable to parents and teachers.
We will begin with the NAEP framework; that's a given. We want to use the NAEP framework because that was developed through a national consensus process. There's not a unanimous agreement that it's a great framework, but there's a vast majority of people that agree that it's a good framework.
And one of the reasons why we're able to get the testing program off the ground quickly is that we don't have to do all that work that the National Assessment Governing Board has already done to develop a framework. We also want to use the achievement levels that the National Assessment Governing Board has developed.
Again, those were developed through a national consensus process. Again, they're not universally accepted but a vast majority of people agree that those are good achievement levels and they communicate what we want to communicate. So we'll be using those two givens in the project and we'll go from there.
As I said, the tests will be linked to NAEP so when we report on achievement levels in this test it will be by way of that linking what we did to the national assessment, and the same thing will be true in TIMSS. And as I said, there will be a separate RFP that will conduct that linking process.
The tests will be up to 90 minutes of testing time; that's about twice what NAEP gives to an individual student. We think that's generally about the right amount of time to get a good, reliable, valid score on individual students. It's generally consistent with what other testing programs do as well. And of course, there's a lot of variability but -- when I say generally, I mean it's sort of in the middle, generally consistent with what other testing programs do.
About half the testing time will be spent on non-multiple choice items, and about half on multiple choice items. Eighty percent of the test will be machine-scorable and the other 20 percent will be constructed response which will have to be scored by raters. We want to have the test as machine-scorable as possible because obviously, we have a large testing program and we want the results turned around quickly. And so that's a constraint.
On the other hand, we want this test to be such that the Math community and the Reading community can stand behind it and agree that this is good Reading, this is good Mathematics. And then the item and test specification work that we're doing right now through the Chiefs and in PR, those meetings are going on now and I'll talk about them in just a moment.
That is the issue that they're wrestling with and at the end, we will want them to sign off on this and to say this is good Mathematics, this is good Reading.
There will be a special booklet available at the time of administration. The special booklet will be a complete booklet of extended and constructed response items that will be developed as part of the field testing, and there will be national data on the items, for example, just like there will be on the regular test.
There will be a new booklet each year. This booklet can be used by teachers for a variety of purposes: instructional uses, classroom testing, whatever it might be. And there will be other materials given to teachers as well, and parents, as part of the testing activity.
But there will be a separate, entire booklet on extended constructed response items -- along with scoring guides, of course.
There will also be a sample test available prior to the 1999 administration. So that will be field-tested in 1998 and then prior to the actual administration we will make available to the public, a sample test along with scoring guides; again, to take the mystery out of the test so that teachers and parents and the general public will know the type of material that will be on this test.
The tests will be released, as I said, every year. The actual test itself will be released every year along with scoring guides and other materials. It will be kept secure up until the completion of the administration. There will likely be like a 1- or 2-day administration period and then maybe a day makeup, or something like that. Right after that it will be released to the public.
We want to have the results reported within the same year, so if we are administering the test in March that means the results will be out probably in May. And again, this is one of the constraints on the program, is that we want to make sure that the test is such that it can be scored quickly enough so that the results can be released during the same school year.
There will be an ongoing research component, so each year we'll be looking at research questions that need to be dealt with. The first year we'll likely look at the validity of the tests for special populations and for certain uses, and then each year new questions will come up and new research will be conducted, and that will be done on an ongoing basis. Funds will be set aside to make sure that research is done.
There will be an ongoing evaluation component so there will be an independent, prestigious group that will report on the activities of the testing program and the success of the program with an annual report to the President and to the Congress. And we're working to try to get that group in place as soon as possible because we'd like to have them here watching what we're doing now, so they can report to the President and the Congress on the success of what we're doing and make recommendations for improvements.
There will be an ongoing advisory structure. The advisory structure, or the panels that are in place now with the item and test specifications is sort of a mini-version of the more permanent structure that we'll have. And I'll get into that in just a moment when I talk about the item and test specifications.
The test is going to be on a 3-year cycle. If you go to the graph that looks like this, what this shows is that in 1999 there will be three testing assessments going on at once. There will be the administration of the 1999 assessment, the field testing of 2000, and writing items for 2001.
As you know, there's a little bit of a snag for the first assessment and that when the contract is awarded in September we would like to have had it awarded earlier in the year but the President didn't make his comments until February. So when the contract is awarded the contractor will have to do some catch-up, and hopefully they will be caught up by March of 1998, and then after that -- I think the work will be at a more leisurely pace once they get caught up, about March of 1998.
A few other things. The administration and scoring of the test. This in fact, is what you're going to be dealing with the rest of the meeting. What we want to do -- and I'm only going to briefly mention this right now -- but the idea here is that the Federal Government, through a contract, will make sure the test is developed properly and that it's -- stand behind as I said, its technical integrity.
But the scoring, the analysis, and reporting is a local responsibility. So this is a test like any other test. If you go enter into a contract with a testing company, or you develop your own test as a State or district, if you don't have the internal ability or capacity to do it yourself, you contract it out. This is the same thing here.
This is a test that will be made available to you. If you as a district, don't have the capacity to let's say, score it and report on it, but maybe you are able to administer it with the proper training, then you would need to go to a company that can do that for you.
What we will have in this project will be licensed companies that will be licensed by a separate contract -- that's the one that we're working on now -- so that if you're a district and you want to administer the test, you would go to one of these licensed contractors, enter into an agreement with him, and they would do the scoring and reporting for you.
We have a commitment from the Department and given Congressional funding that the scoring sites and the companies that do this will be reimbursed in 1999 for their cost. And possibly in future years, again depending on Congressional appropriations. So that's the general way it will work, and we'll get involved in more of the details, the discussion of that later today, because that's sort of the whole reason you're here.
But that's sort of a different twist on this in that we're developing the test, we're making it available to districts and States. And in order for us to maintain the quality control that we like to have without being too restrictive, we're using this licensing mechanism as a way of guaranteeing to ourselves and to the public, that there's a level playing field; that it is being scored appropriately across various districts and States; and the results are being reported appropriately -- things like that.
If you look at the RFP and timelines, the first set of activities is in the item and test specifications, which I'll get into in just a moment. Those are being conducted now by the Council of Chiefs, State School Officers, and MPR -- which is a company located in California -- and those meetings have already begun. I'll talk about them in just a moment.
Right after the item and test specifications are complete -- which was planned for August -- then the test development contract will be awarded in September and the test development contract begins where the item and test specification stops.
What the item and test specifications are, they're like the blueprint for a house, so we need to have somebody develop the blueprint. The contractors in September -- is like the general contractor is going to build the house and they start with this blueprint.
The reason we wanted to have the item and test specifications developed now rather than waiting for September is, we didn't want to wait until September to start that work; we wanted to have that work done so that when the contract is awarded in September they can start the item writing and all the other activities that are associated with the actual test development.
Other awards that will be made soon will be a technical panel. We're trying to decide how we want to structure the technical panel. It will not be a Federal Advisory Committee. It will not be under the Federal Advisory Committee Act; it will be a technical group, probably associated with the contractors to do technical work.
There will be a separate linking contract that will be awarded in October to do the linking to NAEP and TIMSS. The evaluation -- this says October but as I said, we're trying to get this nailed down before October because we really would like to have the evaluators here to see a lot of the work that we're doing right now.
The licensing and certification contract, which is the one that we're working on at the moment, will be awarded in November; hopefully the first part of the month of November. Let me go now to the item and test specifications -- which is the work that's being done right now -- and you have another set of overheads on that.
The item and test specifications are being developed for the test by the Council of Chiefs, State School Officers, and MPR, and this is very consistent with the way that the test specifications were developed for the national assessment. Those turned out to also be done by the Council of Chiefs, State School Officers, for both Reading and Mathematics. So it's a nice continuity from the NAEP project to the Voluntary National Test.
MR. FEUER: Gary, excuse me. Can you just tell people which handout they should be looking at?
Mr. PHILIPS: Yes.
MR. FEUER: The cover says National State Panel and it's in there? Okay.
MR. PHILIPS: The item and test specifications -- the goal here is to take NAEP framework which has already been developed, and to take the specifications for NAEP that have already been developed, and to modify those so that this test can be administered to individual students, and that we can get individual student data from the test. So whatever needs to be done to accomplish that goal, that's what the Chiefs are working on.
And some of the issues that they're looking at for example, is the content and coverage of the test. For example, the national assessment has a lot of content coverage in each administration. We have 90 minutes so we don't have the same total amount of time that NAEP has; because you know, NAEP may only give 45 minutes to each individual student, but across the whole system it might be giving 180 or 200 minutes.
So they're working on ways that we can still have the same content coverage and so that's one of the things, the content coverage: the mix of items, how to weight items, how the things should be sequenced on the test, what are the scoring procedures, the use of calculators in Mathematics, the number of passages in the reading. All those things are being discussed and looked at by the item and test specification committee.
The committee is broken up into a number of groups. There's the national test panel -- and the members of that committee you have there. The meetings of the committee are at the bottom of each sheet. So you have the national test panel -- this is a sort of a policy-oriented group. And what will happen is, the Reading committee and the Math committee will report to them, and ultimately, the item and test specifications will be signed off by this overall, national test panel.
So there's a national test panel. There's already been an initial meeting of each of these. The national test panel has met once; the Reading and the Math and the Technical panel have all had one meeting. And there's another meeting, for example, next week of the Math panel along with a public hearing, in Denver, of the Math panel.
So I just wanted to make sure that you knew who the panel members were. There's also a Math committee chaired by John Dossey; the national test panel is chaired by Bill Cody; the Reading, Dorothy Strickland; and there really isn't a Chair of the Technical Advisory Group -- there eventually may be one.
So those are the people that are working, along with contractual support and commission papers, consultants, and things like that. These are the people that will complete the task of changing the item and test specifications. That work will be completed by August.
Now, I think you need to know, we are moving at a brisk pace but I don't believe we're rushing it. The item and test specifications that NAEP developed, they had basically about another month or so to do this, but they had to start from scratch.
This group is starting from the test specifications that NAEP has already developed, so it's just a matter of modifying what's already been there. So really, we are able to move, I think, quickly and still do a really good, thorough job, because we're able to build on the good work that the National Assessment Governing Board's already done.
The meetings of the item and test specification panels are all open, public meetings; you're invited. The announcements of those are on our Web site. There may or may not be a Press statement of some sort about them. They are public meetings. Each of them will have a transcript. The transcript will be on our Web site; within about a week of the meeting it appears on our Web site.
All the public meetings that we've had are on our Web site. This is a place -- we're using the Web as a way of disseminating information, announcing upcoming events, but also archiving information. If you go there you'll see in order, all the things that have happened: all the speeches, all the papers, these overheads. Everything that's publicly available is going to be on the Web site and we're going to continue doing that. So there's a complete chronology of everything that's happened.
In fact, one of the characteristics of this whole project is that it will be done in a fishbowl from the beginning to the end. When the contracts are awarded to develop the test, the linking, the licensing -- all of these are going to be public meetings; all will have transcripts; all will be on the Web.
The only thing that will not be public meetings will be of course, the writing of the items and the review of the items, because those will be kept secure. So we'll get a chance to see those items publicly. The first opportunity will be the sample test in preparation for 1999. There will likely be some items available as part of the item and test specifications, and then of course, the release of the items at the end of each Administration.
So that's the general plan, and if you have any questions I can take them. Or do you just want to go on? I don't know what the plan is. Do you want to ask a few questions? Yes?
MR. JENNING: Gary, you said materials on the test would be released. Your document says after the test but I think you said orally that materials could be before showing each of the students what was going to be tested.
MR. PHILIPS: There will be a sample test available prior to the administration in 1999. It will be an example of what the test is like; it won't be the test.
MR. JENNING: But it won't be the matter, the principles that are going to tested?
MR. PHILIPS: Oh, yes. That will be available -- the framework is already a public document. The item and test specifications will be a public document. That will be available as well. So the blueprint will be available, and then prior to the administration in 1999, an actual example of the test will be available, along with scoring guides.
Then in '99, in March of '99, the actual test will be administered, and then that will be made available. And so every year there will be a new test that will be made available to the public, right after the administration.
MR. JENNING: So conceivably, a teacher could prepare students during the year for the test?
MR. PHILIPS: You can use the sample test for whatever you want to use it for.
MR. JENNING: A second question. Are you preparing any other test in languages besides Spanish and Math?
MR. PHILIPS: For 1999 we're only planning to do Spanish for Mathematics. In future years, we have to revisit that, but that's the plan for 1999. It is important to note that, you know, we can't do everything in 1999, and the goal is to do the most that we can. There will be incremental improvements and changes in future years as we learn more and, you know, and that sort of thing.
We're also building on the work that the National Assessment has already done. One of the reasons why we're really comfortable with doing a bilingual version of Mathematics is, NAEP has already field-tested a Spanish and a bilingual version and has used a bilingual version in Mathematics at the 8th Grade, so we can build on that. There has not been another language version of Reading in NAEP because reading is in English in NAEP.
MS. BYRD: Could you tell me what the government's purpose is -- at least at this point -- in adopting a national test? What's the kind of stated purpose that the government expects to accomplish with this?
MR. PHILIPS: Well, okay. I think you have to look at sort of a -- there has to be some context to this. You have to -- standards-based reform has been going on for over a decade. Many states have adopted standards -- content standards and performance standards -- and it's been, I think, a successful effort. There's been a lot of national attention to standards-based reform.
This is sort of like another step in that whole process. And what the goal here is, is to provide the same kind of information about content and national performance standards that policymakers have, down into the classroom.
Right now, when NAEP for example, or TIMSS produces this report -- which many people like and many policymakers make decisions as a result of it -- not a single teacher or parent or student has information about how they're doing on that test. And so the goal is to give them, simply the same -- this is an information activity -- it's give them the same information that policymakers have.
And so that really is the whole purpose of it: simply to provide information down into the classroom level about how students are doing on a national standard and an international context. So that an individual student, starting in 1999, will know how they stack up against a criterion-referenced performance standard -- which are the ones that NAEP uses: basic, proficient, and advanced -- and they'll be able in Mathematics, to see how they stack up against students in other countries that have taken a Mathematics test.
MS. BYRD: Is it anticipated that this will affect the actual instruction in the classroom by giving it to the teachers? Do you know, is the Department going that far to make those kinds of projections as to the use --
MR. PHILIPS: Well, I would certainly hope that when teachers get better and more information that they would make use of it, and that sure, absolutely.
MR. KNOTT: What is the role of the States in this? You've referred several times to districts, about the districts contracting with licensed contractors and so on. What is the role for State governments in the implementation of --
MR. PHILIPS: Well, this again -- this is a test like any other test. A State can adopt a test from a norm reference testing company, they can develop their own test. This is a test that will be made available to States -- six States in fact, have already agreed that they want to use the test, so they're just like a district.
So yes, States will -- we expect States to use it, and it's completely voluntary. You know, if a State or a district decides this is not for them, that's great; it's not for them. This is a voluntary test. If they want to use it they can, and we're trying to make it, you know, something that's useful to them.
MR. HEUBERT: We're going to be discussing appropriateness in a number of uses of the tests. Before we get into any of the specifics though, is it the Department's position that it will do what it can to prevent inappropriate use of test results, or to promote their appropriate use? Or is the government's position that once we administer the test the use of it is a matter for State and Local educators and officials to decide?
MR. PHILIPS: As part of the administration of the test in 1999, there will be a number of guidelines which again, will be developed over the next several months -- well, actually not -- it will be, once the award is made in September there will be guidelines for test utilization -- which I think is what you were getting at -- and of course we will deal with the high-stakes nature of the information.
And we fortunately will have the benefit of the Code of Fair Testing Practice and the joint technical standards to help us think through that. and there will also be guidelines on reporting and other types of guidelines. And so the level of specificity I don't know yet because that hasn't happened, but the tests would be administered and utilized within the parameters set by those guidelines.
MS. BAILEY: Could you clarify your earlier point about the voluntary nature of the test? In the case that a State chooses to offer the test, does the Federal Government have a position that once within the State, whether LEA has an option to opt out, and who can make that ultimate decision?
MR. PHILIPS: Well, that's I think, a State decision, just like the State right now. If it adopts the test and the LEA refuses to do it, that's something the States are going to have to deal with. Now there will be, again, guidelines for reporting, so when it comes to reporting, we're likely to have some guidelines here -- I don't know what they are yet -- so that we are assured, and the public is assured that reporting is done in a valid way and that it's not misleading, and things like that.
MS. BAILEY: So your earlier comment about local district choice, that choice is available if the State would not be offering the test? I mean, that's the only voluntary option that the district has if the State has decided it's mandatory? I mean, I just want to clarify that.
MR. PHILIPS: The same situation holds with this test as it would like if a State decided to adopt a norm reference test and the district said, I'm not going to do it. That has to be dealt with internally within that State, so that's not really an issue for us, I don't think, to get involved in.
CHAIRMAN EDLEY: Could we get people to please say their names -- that was Adrienne Bailey -- for purposes of the transcript.
MR. TAYLOR: Bill Taylor. The President has said he's against social promotions. What role if any, does the test have in furthering his view that social promotions are a bad thing?
MR. PHILIPS: I really -- I don't think I should comment. I mean, the President has many policy directions on a variety of topics. If you don't mind I'd like to stick to the test and --
MR. TAYLOR: Well, I'm not asking about -- your comment about whether you're for or against the President's policy. I'm asking you what role if any, the test has in --
MR. PHILIPS: This test --
MR. TAYLOR: -- the Federal plan in dealing with questions --
MR. PHILIPS: This test, like any other test, is information that districts and States use for a variety of purposes. So there's nothing special or different about this test.
MR. MADAUS: George Madaus. You said that there would be no Federal monies linked to this. What's the relationship between this test once it gets going, and Title I?
MR. PHILIPS: Of the uses of this test for Title I, I think is a Title I issue. So that has to be dealt with -- again, this test, George, is like a test that you might want to use: a norm reference test or a State test or a local test, this test.
There is nothing different about this test, so if this test is used for Title I then it's used in all -- then the law about Title I applies to this test.
MR. MADAUS: Then in future reauthorizations of Title I, would there be a firewall that would protect the test from being used as an evaluated mechanism? Because if it is linked to Title I, it's not voluntary any more.
MR. PHILIPS: Again, if you don't mind, I would like to restrict my comments to this test and not the Title I program and future policy directions of that program.
MR. SHANNON: John Shannon. Have any other States indicated to-date that they would not go along with the testing?
MR. PHILIPS: As far as I know, no State has indicated they will not. Six States have indicated that they will, and several -- many, actually -- others are in various stages of making a decision.
PARTICIPANT: Larry -- at one point -- plans to reimburse the -- or whoever is administering the test -- is that still in the plan?
MR. PHILIPS: That's still in the plan for 1999, and depending on, I'm assuming on the success of that, and of course, Congressional appropriations, that may happen in the future as well. But for 1999, it's still the plan. But that does require Congressional appropriations.
CHAIRMAN EDLEY: If folks in the audience can either speak into a microphone or really boom it out so that it can be picked up for the transcript, please.
MS. LEWIS: Gary, does that reimbursement cost include costs for professional development of the administration of the test?
CHAIRMAN EDLEY: I'm sorry, and the name, please?
MS. LEWIS: Sharon Lewis.
MR. PHILIPS: That cost would include reimbursement for the administration, the printing, the scoring, and reporting.
MS. LEWIS: Professional development, the training of teachers to administer the test?
MR. PHILIPS: Training is part of administration, right. And it's likely to be a fixed cost, although of course, the whole reimbursement -- part of the RFP for the licensing is to work through with us, the details of that reimbursement procedure.
CHAIRMAN EDLEY: Just a couple of more questions and then we'll take a quick break. We have guards posted at the door to keep Gary from leaving, so we'll be able to continue asking him questions through the day.
MR. DUNBAR: Steve Dunbar, University of Iowa. Gary, you mentioned the purpose of this program is to provide information about individuals to parents and teachers. Are there plans, provisions, guidelines, that you have thought about in the area of aggregated reports, either at Local, State, or national levels?
MR. PHILIPS: Yes, and this is an ongoing -- we are currently having many discussions about this. Ultimately, it will result -- there will be, as I said, a set of reporting guidelines, and the type of aggregations, levels of aggregations, and just how much we want to weigh in on that will be there. I don't have that yet, today. But I can assure you, we're talking about that quite a bit.
MR. DUNBAR: One quick follow-up. Is there any anticipation of a national report of performance on these things?
MR. PHILIPS: At the present time there is no such plan. However, let me say that as part of the field testing -- in 1998, for example, when we field test the forms, we will have national data from a national probability sample.
It will not be a large one -- it will not be like NAEP, so you can't do all the stuff you do with NAEP -- but there will be some information at a national level, on the various forms of the test, and each year as we're doing the field testing we'll continue to do that.
But we do not plan on collecting information from districts and States and adding it up and getting a national estimate. That won't happen. If you want that, that's exactly what NAEP does very well, and will continue to do that. We don't want this test to duplicate what NAEP is already doing a very good job of, and has been doing for 25 years.
MR. HAKUTA: Kenji Hakuta. How do you resolve the apparent contradiction between the voluntary nature of the test and the requirements that included or excluded special --
MR. PHILIPS: What do you mean? I'm missing the --
MR. HAKUTA: Well, one of the points here is that you'll be requiring -- you have required criteria for inclusion and exclusion of special populations.
MR. PHILIPS: No, we're not requiring. What we're doing, what we're saying is that we don't want to -- we do not want to have certain populations excluded simply because they're members of that special population.
MR. HAKUTA: Right. Under equitable design point bullets and inclusion criteria and appropriate accommodations will be required. If we --
MR. PHILIPS: It will be required of districts -- what that means is that district -- maybe I should rephrase that. Those that administer the test will be required to use inclusion criteria and to use various accommodations -- those that we agree to -- and so, that's what that means. It doesn't mean that a student has to take the test. That's not what I was intending there.
CHAIRMAN EDLEY: So if I can just jump in to clarify. So there obviously are contemplated, at least some regulatory -- I use the term loosely -- constraints on the way in which the test is used.
MR. PHILIPS: There will likely be some -- yes. Regulatory is not the word. There will be some -- for example --
CHAIRMAN EDLEY: Operational constraints?
MR. PHILIPS: Well for example, it might be that if a student, a blind student, there may be Braille/enlarged print versions available. And depending on the degree, the decisions made by the IAP, if you administer this test for certain students, there needs to be a version, Braille/enlarged print. I'm not saying that those will be the accommodations because that has to still be worked out, but there will be accommodations.
CHAIRMAN EDLEY: I'm absolutely certain we will return to this subject a little later in the morning. I think we ought to -- Gary, why don't we cut it off there, to be continued? And if we can take a 4 minute, 59 second break, for people to get more drugs out in the lobby?
(Whereupon, the foregoing matter went off
the record at 9:40 a.m. and went back on
the record at 9:45 a.m.)
CHAIRMAN EDLEY: Dick Elmore, Professor Richard Elmore from Harvard's Graduate School of Education just joined us. Just to refresh your recollection, the goal in this next chunk of the discussion is to try to get out on the table some of the particular sorts of risks, based on historical experience with other large-scale assessments, that might be identified -- anticipated if you will -- in implementation of the President's testing initiative.
So my hope is that through the presentations and in the Q&A, you will all start to assemble a short or long, as the case may be, list of concerns -- questions, issues -- that we would want to see addressed in the design and implementation of the program.
And we have an extraordinary group assembled to help us with that. You have some biographical information in your booklets so I won't belabor that. We've asked each of the presenters to hold their remarks to 10 to 15 minutes, and then we have discussants as well.
And I guess we'll start with Eva Baker who's a professor in the Division of Psychological Studies and Education, and the Division of Social Research Methodology, and Acting Dean of the Graduate School of Education and Information Studies at UCLA. Eva?
MS. BAKER: Thank you. I have even more impressive titles but we'll share those later. Thank you very much for inviting me; I'm honored to make this presentation to the Board. And I have to say that Gary's presentations have evolved from each time I hear them, so some of the comments he has made have caught me up a little short, and some of the questions that you have raised have anticipated concerns that I have.
However, I'm not going to be extremely flexible because I took the red-eye and I'm more likely to stick to my prepared remarks than I normally would be.
Let me simply say that we're all together here because we acknowledge the importance and the complexity of this undertaking, and because we believe that determining what and how well children learn is logically essential to promoting their intellectual growth in an optimal way.
With the right testing systems, students and parents should be able to benefit from feedback, and teachers and administrators could assess responsibility and reset their instructional priorities, and in the longer view, the public could better appreciate the goals and progress of educational systems.
But of course as we know, for many reasons, large-scale testing programs have been criticized for failing to serve well their various constituencies. Test administration and interpretation processes are alleged by sizable numbers, to constrain, warp, underrepresent, overindicate, and generally mechanize education.
Nonetheless, test results and particularly, test results in a comparative framework, have world-wide credibility in public policy and for most parents as well. So I think our problem as a group is to think hard about what we can do to reduce the negatives and to improve the role tests play in our practices.
The general principle that I'm going to advocate -- it's going to be woven through and it may not be as explicit as you would like -- but the general principle that I want you to think about is, how can we remove as much as possible, the incentives for misuse in the national testing system?
I will make some suggestions that might be acceptable, but I'm more interested in your thinking along those lines. My able and charming assistant -- fabulous.
Any consideration of large-scale testing, the frame of discussion normally encompasses at least some of the test purposes listed on the slide, and the extent to which a different or unanticipated use misleads us in our interpretation of results.
We consider here two types of issues. One is the quality of test for the purpose it is attempting to serve, and that is, goes by the name of validity or validity inference, in general. And then secondly, the specific kinds of errors brought in by extending a test purpose beyond the original design.
At the heart of the discussion, again, is the validity of judgments we make from test findings, and with either sort of analysis -- either focusing on a test purpose and its validity for that purpose, or the extension idea -- there are a number of points in the chain of testing events where errors can occur.
And those include: the point of communication of purpose or purposes; the understanding taken from those statements; test administration practices, scoring, reporting, and inference-related actions.
We may make inappropriate inferences because of a mismatch of purpose, the technical design characteristics of the examinations, and/or the conditions under which particular respondents are actually involved in the testing.
Interpretation errors may apply to the entire set of results, or may be focused on the fairness of interpretation for subgroups of students -- and I know some of my colleagues are going to be addressing that issue particularly. I mean, an example is: what about the history that particular subgroups might have; think about their instructional backgrounds; think about Debra P. as an example.
Similarly, misuse analyses often focus on reporting, interpretation, and inappropriate consequences ensuing from adapting tests to purposes for which they were not designed. But before I read for you some of the -- I think it's in our general lore about what test misuses are likely to occur -- I want to raise three linked perspectives that I regard as underlying realities, that will I hope, inform the means we use to select and to promote the best use of tests.
First is the exploration of the concept of control as it seems to underlie the thrust of this conference, at least by its title, and to some degree, the model of the proposed national tests themselves as I understood them. I'm not sure as I understand them this moment.
Regulation of practice, in fact, the acceptance of purpose -- that is, the government's purpose as a context for appropriate use -- assumes that there is an optimum purpose held by an acknowledged authority, then implicitly sets the boundaries for desirable, acceptable, and unacceptable uses.
Without dropping too rapidly into the intellectual morass of multiculturalism and deconstructed meaning, let me simply assert that test purpose resides in the hearts and minds of the beholders. If a common test purpose is to be accepted by a wide range of users, that I do not believe, can be accomplished by dictum and regulation. It's a communication challenge, one in this case exacerbated, I believe, by an unrelenting schedule.
Second, there is a continuum of adequacy about test use, ranging from something that approaches perfect kinds of validity inferences where the purpose, technical quality of the measure, scoring, administration, inferences yielded, line up pretty well to some where there are logical extensions or inferences made, or other applications of data made, to some where there are clearly deliberate misuses.
Deviations then, from the intended purposes, need to be considered I think, from two perspectives. One is the intentionality of the people who are doing it; that is, what's going on and why do they believe this is appropriate.
But more searchingly, I think, from the perspective of the damage done: on the one hand to individuals, and on the other to public understanding of the educational enterprise. Balancing the consequences to individuals and to the larger enterprise creates a tension we've experienced before, and I believe we do not yet know how to resolve.
Third -- and this is the one I think is the best -- in this country we value exploration and innovation. The concept of use control and the measurement specialist's well documented arguments that certain tests should be used solely for certain purposes, flies directly in the face of a powerful legacy of tool development.
Since our prehistoric days, humans have learned and been rewarded for creating an object and testing its broader applicability in areas outside its original purpose. Tom Glenn called this propensity Technology Push almost three decades ago, when describing how new technical applications were conceived and developed.
My favorite -- although unfortunately dated example, is the creation of masking tape, an innovation that led to transparent tape to use for paper repair, color tape for decoration, and -- there's nobody in this room as old as I am, but -- tape for hairstyling -- don't nod; you don't have to reveal yourself -- and for carpet tacking, and so on.
So it's the nature of people to look at a creation, and especially in times of scarce resources, to find other reasonable ways to apply it to meet another important end. Testing, I think, is not immune from this human propensity.
So the pervasive search for this generalized application -- sometimes focused, sometimes opportunistic -- suggests to me that an idea, purpose migration, extending the use of a test to another, somewhat related need, shouldn't continue to be the annual surprise regarded with despair, but should be considered and anticipated. It shouldn't be an unanticipated outcome; it should be understood that it's going to happen.
Instead of bemoaning misuse it is our job, I think, to harness this propensity. And if not warmly welcomed, at least more than one potential use should be sketched out in a risk analysis provided for making inferences from classes of unintended uses. And maybe Gary's comments suggest that that's something that's on people's minds already.
It's my view that particular strategies of test design can also actually, help us optimize use for a broader range or purpose, but even if I'm only partially correct, I think costs will be more broadly amortized and acceptability might grow.
Initial remarks made -- let me talk about historically, two major kinds of unanticipated use problems. Actually the first will be focused on the specific migration of test purposes and what happens there, and the second will be an administration -- more forward-looking issue in the administration of these tests that deal with the security issue.
Throughout, I'll try and indicate, or at least stimulate your thoughts, about possible solutions in this area. So let's go back to the purpose chart. I think I had two of them in there, Rich.
To understand purpose migration, let's look at the chart. Look at column A, column B. For example, it's each to see that many column A purposes based on individual student data, can to some degree, be aggregated to meet institutional purposes in column B.
For example, reporting frequencies or trends of students who receive certification such as diplomas, students who are promoted, or students who need remedial instruction, can be used to make inferences that are program evaluation inferences, or system monitoring inferences.
In each of these extensions of a test, from one kind of purpose to another, concern for the details of context have to be acknowledged. Fewer students may be placed in remedial programs because budgets were cut and not because performance was raised. SAT scores may be higher because of background characteristics rather than efficacy of a school program. We know this.
Considering this idea of concept of purpose migration from within column A, I think, great errors could be made in using assessments designed for placement tests for instance, and move that over to certification.
Because obviously, the test content may be inappropriate and the degree of certainty that one would need for irrevocable kinds of decision would need to be higher for certain kinds of uses; that is, if we were actually certifying somebody, letting them go out, as opposed to placing people into a program where there's an opportunity to regroup if we made an error.
But I think -- and here's where I'm a little confused -- but I think the case-in-point here is using a test that -- Gary talked about it today in a way that I hadn't heard -- and maybe it's my understanding of this that is different. I really thought of this test as principally a system-monitoring test that was given at a census level, and not as an individual test provided for communication and motivation in the column A.
Now, that may be wrong-think, but let me continue and say that, if we were thinking about this test as a test that was going to be given to everybody in a system -- let's say a State agrees or a district agrees or even, I believe that there will be enormous pressures, Gary, to aggregate and report whatever data we have at some sort of higher level -- national reporting.
The question really is, what happens as some of the participants anticipated, if States wish to use these kinds of measures for other higher stakes purposes? There's obviously a history of the attempting to use these tests for system monitoring, such State assessment tests being used to make decisions about the effectiveness of educational administrators.
And the lesson is -- and I think Dan Corets and others have said this quite well -- is that States become attached to tests any time broad-based public reporting occurs; whether or not it's the intended purpose of it by the promulgators of the test. And of course, sometimes interpretive errors will occur.
My favorite example is from Lee Bernstein's work based on a California school district that reassigned principles based on changes in performance on the State assessment, when in fact, the real changes were due to in-migration of different kids coming to different districts. It had nothing to do with the principles, you know, propensity to be an instructional leader, but all these people were moved around because of that.
A second type of purpose migration -- I'm almost done -- involves the extent to which a test, created for broad system monitoring, is appropriately used to make student retention/promotion decisions. In the present plan, I believe this is considered as a type of misuse. It would undoubtedly result in relatively unreliable classifications of students into promotion and retention categories.
And even if the test were of such length to permit adequate classification of students, such uses would also require -- I think the measures be closely connected to curricular offering in the particular local setting to assure that decisions about students were made on instructionally-relevant grounds.
The next slide -- we should recognize that an assessment designed to provide student level data and perhaps also to provide system monitoring, undeniably creates an expectation for improvement in all participating districts and States. When expectations are raised, no matter what the nominal purpose of the test is, pressure for improvement occurs, perhaps specifically linked sometimes to job performance goals for Superintendents and Administrators.
If no reasonable avenue is provided for instructional improvement -- such as empirical guidance, clear and plausible choices for teacher action, or strategies for teacher preparation -- systems people have fallen back on their logical options. Students have been urged to practice test formats in the hope that they will be able to raise their scores, rather than given instruction in the content and concepts underlying the items selected.
Curriculum narrows and increased time is spent on test preparation as a separate event, sort of apart from the regular curricular, and instructional focus of the schools. The test works out to become a barrier, an enemy, something that pulls times away from reality, rather than a neutral benchmark.
The design of any large-scale test should take into account the certainty of this intention to improve scores and help educators to find the balance between appropriate focus and the dysfunctional narrowing of attention.
Releasing items may simply exacerbate the problem, reinforcing some notion that the best way to do this is item-by-item practice. So I applaud the idea of the search for high quality, understandable, and concrete test specifications. If they're done right they can provide some bridges to show how the underlying constructs relate to State, district, and teacher goals, as well as legitimate instructional changes.
Precepts may guide us, reinforcing the expectation that validity evidence will be needed for every new purpose. That will be something I'll allude to briefly when I talk about the standards. But in practice, the question really will arise about who or what groups have the capability to provide such evidence, entwined with the reality of rapid and unexpected adaptations of uses of tests.
So while we have sort of rules and guidance about how we made validity arguments for different purposes, I think the reality sometimes gets ahead of us.
This next slide says, Test Administration Tech Challenges. To return to the topic of control and practicing items, the question of test security is on my mind. Let me raise two concerns, and I'll do it very briefly. The first is that even within certain school districts, instructional schedules vary considerably.
If it were intended for instance, to test all students in year-round schools in Los Angeles, a fixed testing window might very well result in great numbers of students missing the test, and the question is how make-ups would be handled, or if they would.
My second concern is far more pervasive, It involves a rapid obsolescence of the idea of centralized control in a world that is becoming far more accustomed to information access and its distribution. I've talked in the last month with a lot of people about the extent to which we believe, or it is believed, that security can be maintained in the era of the Internet, given just the numbers of people who have access to this examination. And my guess is that it's functionally certain there will be a breach of test security.
Just for a moment, even if you believe that, what would you do about it; what's the backup plan? My suggestion is probably one that -- I won't make a comment. I propose dropping the concept of test security entirely and instead, create specifications and release the large set of items in advance to all students in schools.
If the set is large enough and the specifications are clear enough guiding, then practicing individual items won't be seen to be the optimum strategy for learning these kinds of skills, and public release would of course, up the pressure on the providers to assure that high quality matched between specifications and item occurs.
I won't go on about that, but I think that is really an important consideration for you. And if that's not the strategy you adopt I think that the message of that is that you're -- what I'm trying to do is to find a way to create a disincentive for cheating.
The larger comment that I have to make -- I'll stop -- but it really does have to do with, how do we help people understand -- and I know this isn't the topic -- but understand the relationship of these tests to NAEP, to what they've been doing in standards-based test development, commercial tests. How do we help them resolve different information?
And it seems to me that that's something that's extraordinarily important for us to do if we want to demonstrate that this assessment has value-added, over and above what is already going on in the system. I think that we have to find a way to sort through all of these needs and test interpretations, and I think the struggle is worth undertaking. Thank you very much.
CHAIRMAN EDLEY: Thank you. We clearly made a mistake in that Eva's background and her role as co-Chair of the Joint Committee on Standards for Educational and Psychological Testing and so forth. I think we should have just arranged to spend half a day with Eva. It would have been time well spent, of course. And it's really Michael Feuer -- Michael Feuer is to blame for not organizing this in that way.
Next up is Richard Duran who is a Professor in the Graduate School of Education at UC-Santa Barbara, and is a member of BOTA. His fields of expertise include assessment and instruction of language minority students, and design and evaluation of interventions assisting language minority students. Richard?
MR. DURAN: Okay. Thank you, Chris. My remarks will focus on a variety of issues surrounding exclusion and inclusion of students with disabilities and limited English proficient students in the new assessments. I'm going to focus in particular, on some of the language of the RFP in trying to get us to understand the possibilities that are there and some of the challenges, but I will move into a discussion about the purposes of the assessment and the connections of inclusion to what research knowledge tells us might be possible with examinations that might be examined in their nature.
I will not be speaking about legal requirements surrounding the need to include these students, but note that these legal and statutory mandates ought to be interpreted in a manner making best use of contemporary measurement theory, and research and policy analysis regarding fair and effective measurements of students with disabilities and LEP students. And we have plenty of experts here today that can help us with the legal origins of inclusion in terms of how it's represented within the education system currently.
So a typical cut through these issues would examine issues of inclusion, would look at validity, reliability, comparability of scores of students who receive accommodations in testing, and the issues of fairness in testing in terms of being able to actually assess what students are capable of.
But I'm going to weave the discussion of these issues moreso that it's oriented to the actual language of the RFP so that we can begin to work at clarifying some of the questions in the actual design that's unfolding.
One of the goals of making the new assessments inclusive is to permit inferences about the achievement levels in Reading and Math of all students -- all students. It is recognized that exclusion of students with disabilities and limited English-proficient students from large-scale assessments has distorted the validity of large-scale assessments such as NAEP -- at least allegedly, when we asked the question carefully.
Achievement levels are most likely higher when students with disabilities or LEP students are omitted from assessments -- and some of the work of the NAE has suggested this. Maximally including students with disabilities and LEP in the new assessments would increase the capability of policymakers, educators, and the public, to make accurate inferences about the performance levels of all children, the schools, districts, and States; subject to the caveat that we need to ask questions about the history of the students and ask whether the assessments are appropriate given their academic history.
Further, maximal inclusion of students would send the message to students, teachers, parents, and others, that education change based on assessments is seeking the same improvements in achievement for all students. Increasing inclusion of students with disabilities and LEP students in the new assessments will require accommodations in the instrumentation and administration of new assessments.
I note that the definition of assessment accommodations is -- there are many definitions out there. But a general definition that we would use is, non-standard forms of test administration or responding on tests. So a deviation from a standard version that is meant to allow students to show their ability.
So specific accommodations to be available in the new assessments are stated as follows. Students with disabilities -- the deliverables include Braille and large print on both Reading and Mathematics assessments for blind students and students with limited vision.
And for LEP students, English audio cassette version of an examination plus for Spanish LEPs, a bilingual Spanish/English version of the Mathematics examination. I need a little bit of clarification on the audio cassette version -- whether that's just going to be for the Reading or for the Math.
Now, it is noteworthy that the RFP then, for the new examination, specified deliverable accommodation only for visually impaired students for students with disabilities and not for other categories of disabled students who form the bulk of students categorized as students with disabilities.
One must keep in mind that other assessment accommodations will be implemented for students with different disabilities, and indeed, must be implemented if so stipulated in the disabled student's IEP Plan, or by State regulation. Indeed, other forms of assessment accommodation are mentioned and intended in the design statement of the new assessments.
For example, on page 23 of the RFP in Appendix A, mention is made that it is expected that reasonable accommodations for students with disabilities or with limited English proficiency, will be provided by the school administrator. And on the top of page 19 of the RFP, the section on task 17 mentions some further details on that.
The contractor shall conduct ongoing research into the reliability and validity of the national tests. A number of issues have already been identified in order of importance, and others will arise over the course of this contract.
Validity of test scores under non-standard test conditions; that is, the impact of testing accommodations -- Braille, large print, extra time, one-on-one testing, etc. -- upon the validity of test scores and the feasibility of developing glossaries for use in the national test of mathematics for native languages other than English. That's on the table.
Now what I'd like to do is to skip the next transparency from the two following ones, the first. So mention is made then, of the need for research on validity of test scores under a variety of conditions. Now a recent report that has come out -- that I'll reference to you more carefully, by Jon Olson and Arnie Goldstein from NCES -- drew on another study looking at the kinds of -- on 22 States -- of the kinds of accommodations that disabled -- students with disabilities receive in testing.
The relative frequency of these varies quite a bit. Some of these are very rare; others are found more. But this gives you an idea of some of the variation that's going on out there, and so when we look at what accommodations mean, and we look at how they might influence test scores, then there's quite a bit of stuff going on out there, and a lot of that is not necessarily going to be under the control of the examination system.
If you'll go to the next transparency you'll see a list of accommodations that are also provided for LEP students. Now, you'll see some overlap of course, with some other accommodations that are provided students with disabilities, but these have -- in a survey that was done by NCREL, they found that among 22 States, these were some of the accommodations that were given LEP students.
Okay, so that's kind of under control of the education system and part outside of this testing system, but raises significant challenges for thinking through how accommodations might affect test scores.
Now, if you put on the transparency we skipped -- thank you. A new NCES report by Jon Olson and Arnold Goldstein is an invaluable source for analyzing the foregoing issues on both policy grounds and research. The report reviews ongoing and previous research by NCES and NAEP contractors on NAEP, recent OERI-sponsored State-based studies, including studies by CCSSO and CRESST, and studies by the college board and ETS, among other agencies on these issues.
This is a landmark document in terms of pulling together what the issues are for large-scale assessment. It's a very useful resource.
One of the main outcomes of previous research that's cited in this report has been that allowing accommodations can make an assessment easier for all examinees, regardless of student's disabilities or language status. But this finding is not uniform.
Research on the effects of accommodations on performance on the new Reading and Math test will need to contend with the mix and variation in the use of different accommodations among students with different disabilities and among LEP students. I mean, it's going to be an analytic dilemma, about how precise to be in terms of looking at how performance might change.
Further, the effects of accommodations will need to be studied among students who would otherwise not be allowed accommodation, in order to determine where the patterns of accommodation change a measurement target. So there is a lot of ongoing research in experimental design research looking at whether accommodations help students who would not be labeled as students with disabilities or LEP students, and what this means.
The new assessments are presumably intended to be power tests rather than speeded tests. Power tests are intended to gather valid and reliable information about students' maximal proficiency in a subject matter area. Effects for time is found to improve for example, to improve the performance of all examinees substantially, and not just the performance of students with disabilities or LEP students.
Then it would seem very important to analyze the constructs being targeted for assessment by the new exams. Are they appropriate constructs, if they show speeded effect? Another alternative would be to build the notion of speededness into the construct, and I would not dismiss that possibility based on cognitive research and shows that speed of processing and ability to make verbal associations are very related in terms of performance on verbal ability tests.
Design of studies to investigate the effects of accommodations will be a considerable challenge. It would seem useful, obviously, to begin carrying out while the exams are still in the pilot administration phase, and that's the situation here -- at least it's targeted.
These studies will be very difficult because States and LEAs, as we've seen, vary in how they implement the definition of students with disabilities and limited English proficiency. This is a big problem. What do those categories mean? There's a lot of variance in the operationalization of the definition of disabilities and in how states characterize limited English proficiency.
Studies of this kind will be made further complex by the need to evaluate commonality in the actual criteria used by local assessment administrators to exclude students who are judged as incapable of being examined. And also, the actual procedures used to assign students to accommodated versus non-accommodated examinations.
And here, what I'm referring to is the difference between saying this is the way you ought to do it, and what really happens. That can be extremely noisy and is a very important issue to investigate with these new exams.
The use of an audio recorded Mathematics test and Spanish/English Mathematics examination raises special questions. More details on the precise administration procedures and materials for these accommodations are needed, though some important procedural details have been made. And these are policy decisions and you know, one needs to think carefully about what they mean in terms of what we know from research.
In the current specifications, LEP students with more than three years academic instruction will be asked to take the national test of Reading and Mathematics in English. LEP students with less than three years academic instructions in English would be given the English Reading and Mathematics assessment, unless school staff judge them as incapable of assessment in English. Criteria for this latter judgment are inneed of elaboration.
Another problem to be faced is test item development procedures for handling socio-linguistic variation in Spanish -- a notorious problem to developers of assessments in Spanish. Exactly how will this be handled?
One obvious strategy is to enact an item review process that catches and edits terminology or phrasing in Spanish that would not be recognized universally by competent, native language speakers of English. We strive for that, but it's hard to attain.
Scoring of LEP students' short response performance items in Mathematics raises an important issue. Will scorers be trained to focus on students' mastery of intended knowledge of problem-solving skill given limitations students might show in the English language proficiency?
Research on the former class assessment in California suggests that non-English background children's writing may convey evidence of student's mastery of subject matter knowledge, despite infelicities in English, and that scorers might be trained to be sensitive to appropriateness in written content despite children's limited familiarity with English.
These are some of the underlying questions that cut across use of accommodations with respect to inclusion. I'm not going to go into these in detail, but they are issues that need more elaborated discussion.
In order to control my time, I'm going to move ahead and I'm going to talk about a couple of controversial points. We mentioned a little bit about the Spanish and English Math exam, the criteria for use, the development of Spanish translation, the training of scorers.
Now, I'm going to bring up something that is an example where, you know, researchers in the field, looking at what Reading is -- what the development of Reading is among bilinguals. Now here I cite the recent NRC report, Improving Schooling for Language Minority Students -- a research agenda that was published this year.
In terms of dealing with inclusion -- and as Eva and I just had an exchange -- what do we do with a school district like L.A. on the Reading exam? What are we doing? Are we understanding the distribution of concentration of students with different language characteristics and how we have to contend with that in terms of actually getting at a kind of grassroots understanding of what achievement is?
Now, reports such as the NRC report on Improving Schooling for Language Minority Students, I think cites plenty of evidence that students' development of skills in Reading and language skills, are transferrable into a second language.
Now, like any area of research there are controversies about this, but there is a fair amount of consistency about this point, and it's one that bilingual education researchers have made over and over and over as a working hypothesis that seems to be difficult to challenge; that's a very good way of posing education change for students in terms of developing students in terms of the resources that they're capable of managing.
It's my personal opinion -- not representing BOTA or NRC -- that we still face issues of inclusion that are not dealt with well in this examination system, and that adding the possibility of an examination in Spanish in the area of Reading might be an example of a good development that would help lead to a better understanding of what students can do.
I'm not talking about knowing English when you leave high school; I'm talking about what you're doing in 4th Grade Reading as a foundation for being able to deal with text. So that's an interesting question for me.
In conclusion, I want to raise one other issue that's going to come up later on -- certainly, John Fremer's going to deal with this, and Eva in terms of the Standards. I'm not sure how responsibility for inclusion is going to be distributed across -- and responsibility for analyzing what accommodations mean and how they influence test scores -- across different agencies. It's blurry to me.
If we look at the JCTP documents on guidelines for fair and equitable testing it's clear that we can assign responsibility across just about every agency that has something to do with the development, administration, and use of tests.
But I think that in the area of inclusion and the use of accommodations, those responsibilities have to be sharpened and it has to be clear exactly what's going to happen. And there's a potential here that this could be a very litigious matter if it's not dealt with properly.
One closing comment and that is, I haven't here addressed in any depth, issues of the academic history of students in the appropriateness of tests. I've taken the tests as a given, but I think that there are other issues to pursue that deal with what inclusion means.
If students with disabilities and LEP students tend to perform lower as our data indicates so much, then I think we need to deal with appropriateness of the tests in terms of their achievement proficiency, given where they're at.
And that's a very basic issue that needs a lot more attention in order to really get at the heart of what these assessments are supposed to be doing in terms of providing information for improving educational outcomes. Thank you.
CHAIRMAN EDLEY: Richard, thank you very much for that. Our next presenter is another Richard. Richard Jaegar is the Excellence Foundation professor in the School of Education at the University of North Carolina, Greensboro. I, myself, are from the fairly adequate foundation, professor of Law.
His fields of expertise include educational research methods, educational measurements, standard setting and performance assessments, teacher certification, and the understanding and use of test results by policymakers and others. Professor Jaegar.
MR. JAEGAR: Well, I'd really looked forward to the opportunity to hassle Rich Shavelson about handling my overheads because I have only one, but the timing is critical and now it's been blown by Gary Philips. But I too, will stick very closely to my prepared remarks because of the time limitations.
One could argue that Voluntary National Tests in 4th Grade Reading and 8th Grade Mathematics are like any other standardized tests adopted by States or school systems to assess their students' achievements. Indeed, that argument was put forth a number of times during public meetings on the national tests held on March 4th, March 26th, and May 19th, and again, here today.
However, tests that carry the Federal imprimatur and serve the catalytic objectives envisioned by the President, the Secretary, and the Deputy Secretary, cannot be like any other. Their very purpose invokes burdens of fairness, precision and validity that surpass those imposed on tests used solely for description or pulse-taking.
The consequential side of the matrix weighs heavily on a national test and the strategies and procedures used for reporting test results warrant particular scrutiny.
During the public meetings mentioned earlier, Mike Smith and Gary Philips identified reporting to parents and teachers as the central goals of the national testing program. Four challenges must be met with both groups, and those are the challenges that are on the overheads here.
Parents and teachers must be motivated to consider the results of national testing. The results of national testing must be presented in ways that parents and teachers can readily understand. Parents and teachers must be convinced that the results of national testing should be valued. Test results must be communicated to parents and teachers in ways that foster valid interpretations and inferences.
Although the proposed national tests in NAEP differ in important ways, the NAEP experience must be considered. The issue of audience motivation has plagued NAEP since its inception in 1968 when a professional journalist was employed in an attempt to develop interesting copy for major newspapers.
Granted, parents should be more interested in the test performances of their own children than in the distribution of achievement scores for their State or the nation. But the parents of children most at risk of failing to read or solve challenging mathematics problems are those least likely to attend to their children's test scores.
As a former teacher in inner-city New York testified at one of the public meetings, parents there typically did not respond to requests to sign and return their children's report cards.
The major reporting challenge is finding ways to reach parents of low-achieving children and to convince them that they should review their children's scores on a national test. It won't be easy.
From my own work, that some of the nation's most talented classroom teachers, through the National Board for Professional Teaching Standards, I can tell you that teacher's reactions to standardized testing in any form, range from indifference to revulsion. Most teachers consider the information provided by externally-imposed tests to be largely irrelevant in the context of their detailed, daily observations of the academic strengths, weaknesses, capabilities, and needs of the children they instruct.
Further, they regard the high-stakes test used in local and State programs of accountability, with fear and loathing. The best teachers regard such testing programs as unwarranted intrusions on their opportunities to function as independent professionals in selecting strategies for effective instruction, in methods for evaluating their student's growth and development.
The impact of high-stakes testing programs on the content and methods of classroom instruction have been documented extensively in studies conducted by Mary Lee Smith and Laurie Shepherd. The picture painted by their findings is neither benign nor encouraging. They discovered endless days of mindless drill and practice on the form and format of standardized test items, with consequent loss of curricular depth, breadth, and innovation.
It would be difficult to report results in ways that convinced teachers that the national tests are worthy of their attention. The proposed composition of the test -- fourth-fifths multiple choice items -- will only exacerbate this problem.
Test results must be communicated to parents and teachers in forms they can readily understand, despite rampant innumeracy and general ignorance or fear of statistical terminology and data summaries. Aschbacher and Hermann indicated that many readers of test reports do not understand such basic terms as "average" and "norm".
The Gallup Phi Delta Kappa poll of public's attitudes toward the public schools has been singularly successful in obtaining and communicating parent's evaluations of public schools on the traditional A through F scale. But most parents and many teachers cannot interpret test results in such widely-used scales as percentile ranks, grade equivalents, and State line.
Hamilton and Slater found that policymakers holding advanced degrees had difficulty understanding and correctly interpreting the tabular summaries used to convey national assessment results. Parents, who are typically less well-educated and have far less frequent exposure to data summaries of any form than the Hamilton and Slater interviewees, therefore can be expected to have even greater difficulty.
One particularly disheartening finding from the Hamilton and Slater research was the difficulty their interviewees had interpreting graphical summaries. The picture being worth a thousand words adage might not hold when achievement test results are summarized unless the picture is especially simple and straightforward.
Even if parents could be motivated to read about national tests and such results can be communicated in ways that are understandable, it is not necessarily the case that they will value the results as indicators of their children's achievement or of the quality of their children's school.
Findings from three studies bear on this issue. Shepherd and Blye interviewed 105 parents of 3rd Graders in a Colorado school system about the usefulness of different types of information for learning about their child's progress in school. They found that two-thirds of respondents rated standardized tests below the midpoint of a 5-point scale, with 1 meaning not at all useful and 5 meaning very useful.
Only 14 percent regarded standardized tests as very useful in contrast to 77 percent who so regarded, "my children's teacher talking about his or her progress". And 43 percent who so regarded their child's report cards. Yeager and colleagues conducted detailed analyses of the content of over 500 school report cards produced by school systems throughout the nation.
Using protocols grounded in their content analyses researchers interviewed 166 parents of public school students in Greensboro, North Carolina, and Sacramento, California, to determine among other things, what parents most wanted to know about the condition and effectiveness of their children's schools.
When faced with paired choices among categories of information, parents in both cities agreed that, "school environment information, information on the safety of the school, and the extended involvement in the school by parents and other members of the community" was most important to their evaluation of the quality of their child's school.
Parents ranked school success information -- that is, information on the school's graduation rate, student promotion rates, number of A grades awarded, student's after-graduation plans, student special awards or honors earned, and student's athletic accomplishments as second most important to their evaluation.
It is of interest here that standardized testing information defined as statistics that could tell you about the standardized test performances of all students in your child's entire grade or your child's entire school, was rated by parents as only third most important and the scale just higher than a category labeled, "student engagement information" -- which was information on the school's attendance rate, its dropout rate, the number of students who had been suspended or expelled from the school.
These findings are consistent with those reported in the 23rd Gallup poll in the public's attitudes toward the public schools. In that survey parents were asked to read the importance of various factors in selecting the school for their child were school choice a possibility.
Quality of the teaching staff was rated very important by 85 percent of responding parents; followed by maintenance of school discipline by 76 percent; curriculum -- that is, the courses offered -- by 74 percent; size of classes by 57 percent; and grades or test scores of the student body by only 46 percent. One percent had a track record of graduates in high school, college, or on-the-job.
The bottom line here is that standardized test results are not regarded by many parents as important indicators of the quality of their child's school or of their child's progress in school. If the national tests are to stimulate the reforms proposed in the President's State of the Union Address, test results will have to be presented in ways that convince parents they're important and worthy of their attention and concern.
And finally, test results must be communicated in ways that sponsor valid interpretations and inferences. Again, it won't be easy. Murphy conducted research on the effect of reporting format on elementary school teacher's interpretations of standardized test results. He presented 671 teachers with score reports in both narrative and graphical tabular formats, followed by a series of interpretive statements with which they could agree or disagree on a 5-point scale.
Each statement represented an intentional overinterpretation of the data presented. Murphy included such statements as, "compared with students nationwide this class is below average in Math concepts and above average in Math computation. This student has the Math solving skills of a 3rd Grader. And compared with the nation's 5th Graders, this student is above average on the skills covered under language analysis".
Murphy found that sample teachers accepted gross overinterpretations of achievement test results regardless of the format used to report the test results and the number of courses and workshops on testing and measurement they'd completed. That last finding was particularly disheartening to me.
His conclusion is summarized in the following statement. "The overinterpretations concern concepts that are central to the field of testing: concepts of reliability, error, probability, and approximation. And if teachers cannot interpret such concepts ably, the most central concept of all, validity, becomes at issue. The inferences from test scores that were presented as part of this study were simply not valid."
In a chapter titled, "Five Common Misuses of Tests", first published in the 1982 NAS volume on ability testing, Eric Gardner cautioned against acceptance of the test title for what a test measures: ignoring error measurement in test scores; using a single test score for decision-making; lack of understanding of test score reporting; and attributing cause of the behavior measured to the test that conveys the information.
Each of these cautions applies without modification of the planned voluntary tests in Grade 4 Reading and Grade 8 Mathematics. First, although correlations among Reading subscores are high, it is not the case that Reading is Reading is Reading, particularly when results are interpreted as an indication of what students know and can do, rather than how they fare in some relative sense.
As Gardner noted, "There is a tendency for unsophisticated users to accept the name assigned to a test as an accurate and complete description of the variable being measured".
Research by Shavelson and colleagues indicates that examining by exercise interaction variance is a major contributor to individual differences among test scores, particularly when performance items are used. Student performances depend critically on the specific content of the test, not merely on the test framework, so performance information must be generalized with caution.
At a recent NAS conference on NAEP performance standards, Linn noted substantial differences between performance standards based on NAEP's dichotomously scored items and extended response items, and concluded that the proportion of students would be classified as basic, proficient, or advanced, is quite sensitive to the composition of the assessment.
He reported that 78 percent of 4th Graders would have been classified as performing in the basic category or above, had the cut score been determined using the only dichotomously scored items on NAEP. But that only 3 percent of students would have been so classified had the cut score been based on extended response items.
This finding is particularly troublesome when viewed against the intention of linked national test to NAEP and to its achievement levels. Although I realize that the technical design of the national test is a work-in-progress, and that statements quoted out of context from transcripts of public meetings must not be regarded as definitive. The juxtaposition of Bob Linn's findings and Gary Philip's statements during the public meeting held on May 19th is disturbing, and I quote Gary.
"This is intended to be a test that specifically is focused on giving good information to the parents and teachers. The Reading and the Math will provide national standards and will do that through statistical linkage to NAEP, so we'll be able to provide basic, proficient, and advanced information on the test."
Linn's findings indicate that what parents are led to believe about their children's performances on the national test will depend substantially on the particular items that compose the test, and on the proportions of those items that are presented in dichotomously scored formats.
Whether Johnny Jones is a basic, proficient, or advanced 4th Grader -- 4th Grade Reader, and the percent of 4th Grade students who are classified as basic, proficient, or advanced readers in Johnny's school, district, or State, will be highly manipulable, in an artifact of the construction of the national tests. To the degree that the format composition of the national tests differs from that of NAEP, the NAEP achievement levels will not carry the same meaning for students who complete the national tests.
Public disclosure of the full test might help a bit here, but most parents cannot be expected to review the test and most will believe that it measures 4th Grade Reading or 8th Grade Mathematics regardless of its content and composition.
In keeping with the title of this session -- Potential Risks and Unintended Consequences of Testing -- the principal message conveyed by this paper is one of gloom and doom. I'd like to end on a more positive note.
The measurement literature contains several good papers on how test results should be organized and reported. The Aschbacher and Hermann report mentioned earlier -- although not grounded in new, empirical work -- draws heavily on related psychological literature and research and business and marketing.
Generalization of these findings through achievement test reports is somewhat an act of faith, but the recommendations they make are certainly sensible. Similarly, the suggestions made by Hamilton and Slater make sense, even though they haven't been validated with real consumers or test reports. They call for simplification and narrative explanation, combined with graphical display of results.
The same is true of recommendations provided by Howard Wainer in his lead article in spring 1997 issue of The Journal of Educational and Behavioral Statistics. Wainer's recommendations are appealing and sensible. He illustrates a number of clever ways in which tabular and graphical data displays can be formulated so as to emphasize important results and eliminate the unimportant.
Even in the absence of validation it seems obvious that Wainer's recommendations must result in improved communication. Well, these recommended reporting strategies effectively address the challenges described in this paper: motivation, understanding, valuing of results, and valid interpretation.
I cannot emphasize enough the importance of exploring this question through a sound program of research. If the national tests are to facilitate educational improvement as the President and the Secretary hope, the message must be understood and must be compelling.
To make it so, we must learn what parents and teachers will examine, what they infer, and how the packaging and presentation of test results can foster accurate and useful interpretation. Study of effective score reporting must be a major component of the program of research and evaluation envisioned for the national tests. Thank you.
CHAIRMAN EDLEY: Thank you very much, Richard. That's why he's the Excellence Foundation professor. Let me just -- we've just heard from the very eminent experts in this field of testing, and I was venting some anxiety before the session with Michael and Pattie that I wanted to make sure we included in the discussion some frank concerns about the program; that I didn't want unrelenting cheerleading.
I think the warning flags -- by my count, I think we're about up to 84 warning flags. But this is just to remind everybody that by the end of the show the goal of course, is to try to figure out strategies that the department might use to try to minimize these risks. So I think it's appropriate that we begin with a fairly exhaustive and comprehensive look at what those risks might be.
Let me just add that one thing that we have not been doing thus far, and I hope we'll get into it in the discussion, is trying to get from people some sense of the magnitude of these risks, and perhaps the relative importance of these risks beyond simply being exhaustive in our enumeration of them.
We're going to shift gears slightly now and hear from Janell Byrd who has really done us a great favor by, at the last minute, agreeing to come and talk informally, presenting not the perspective of an expert on testing, but rather, that perspective of several elements of the civil rights community.
Janell is an attorney with the Washington Office of the NAACP Legal Defense Fund where she's been for some years. I hope she won't mind my saying this, but she is one of the most accomplished and highly regarded Civil Rights litigators of her generation, and in particular for our proposes, has done a tremendous amount of litigation under Title VI of the Civil Rights Act, and is widely regarded as having done a state-of-the-art job of assembling expert witness testimony in some recent litigation.
So Janell, thanks very much for coming, to raise some questions from the Civil Rights perspective.
MS. BYRD: Thank you very much. And Chris correctly stated when he said I didn't have a -- my arm was somewhat twisted and I don't have a prepared text. But I did have a lot of questions before I arrived and I guess in part, because my arm was twisted, I don't really have to play by all the rules. So let me say this.
I understand the premise of the session is to come up with ways to minimize harm and to give advice and guidance to the Administration and the Department as to how to design and implement these tests. But I cannot, having listened to these three prior presentations, I can't stand here and say, why the rush?
I mean, why are these questions not being answered in advance of the decision to move forward with these tests? It is obviously the first question: why are we doing this; is this the right thing; should we be moving forward?
It seems to me that we're saying, how do we correct this when we haven't decided that this is what we should be doing? I think that question has to be on the table, and I think that it is quite a serious problem to pre-empt it. So I encourage you strongly to take a step back and ask the first question, which the panelists, I think their presentations -- I mean, Chris said 84 alarms. I was thinking -- I was only counting the panelists -- I said three alarms. I said a 3-alarm fire. I said, my God, what are we doing?
From the Civil Rights perspective I would say that, one thing that comes to mind is the obvious concerns for minority children, for poor children, for having a national test. I mean, we're talking about communities which are often and increasingly, isolated from the majority-wide community; communities which have fewer resources and where the parents and the teachers and the schools are often under siege in ways -- from disadvantage and poverty -- which is not experienced in the broader community.
And we look also at the minority and poor students who are in majority institutions and ask how will those disadvantages that they face in this society be translated through the use of this exam? Now obviously the question is, what are the anticipated uses? And unfortunately, Gary Philips -- I mean, since he's from the government so he's in the hot seat, so everything's been directed at you, and to a certain extent I apologize -- but I guess you knew what was coming when you agreed to come.
But I think first and foremost, the question of how these tests will be used, and in your presentation you made a point that these tests will be used like any other tests. Well, will they be validated for any other purpose, other than just giving information to parents?
It is inconceivable to me that the test could be appropriate for use for tracking, for special education placement, for high school graduation, for promotion from grade or retention in grade, without being validated for that purpose.
And it would seem to me that the government would have the responsibility of making sure that if it is anticipated that these tests will be used for any other purpose -- and I think Eva Baker made it absolutely clear that the tests will be used for other purposes, if they're in the student's file, if the teacher has the test -- I do not believe, and I don't think many people in here would believe that the tests will not be used for high stakes purposes.
Being honest about that I think, requires us to say, one, you know, will it be valid for that purpose, and to the extent that it's not, what enforcement mechanism will there be; is it capable of being policed if there is an enforcement mechanism?
So for example, if you decide to recommend that we should say these tests should not to be used for high stakes purposes, well what does that really mean? And is anybody prepared to -- is that even capable of being policed? I don't think that we can expect the Department of Education, Office for Civil Rights, the Department of Justice, to enforce that.
I think the track record in those institutions in enforcing these kinds of mechanisms is not a good one, quite frankly, and so I don't think -- I think in being honest about this, simply saying don't use it for this purpose, will be meaningless. And so it will be used for high stakes purposes, and that there are not enforcement mechanisms in place to make sure that that doesn't happen.
What is the meaning of this information to parents if there's no other information about opportunity to learn? I mean, if the purpose is you think you're empowering parents, if there is no information about, you know, teacher/pupil ratios, resources, funding, the variety of things that might conceivably put this in some context which would allow parents and teachers to do something with the information, then it's really I think, pie-in-the-sky to expect that simply giving a test score is going to have any different impact than giving grades at the end of the year that might be poor grades.
I mean, is this going to be a punitive measure? Is this just another way in which the kids and the communities and the schools which are most disadvantaged, will be blamed and said, they are the problem, they are at fault?
Further, as we talk about measuring students against a national standard but it's a voluntary test, obviously the question that presents itself is, who's most likely to opt out of this? From some of the things I've been hearing, the States that are most likely to opt out are some of those States -- some of the deep Southern States, some States with high minority populations, and query whether we then have any kind of national standard if that is indeed, at least one of the purposes of this exam.
That's just the beginning of the questions. I mean, it's obvious from the audience here, the wealth of knowledge that you have, that you all probably had all these questions to begin with and more.
I'll just say that my main point here is that this is a frightening proposition, it is being rushed into by the Administration I think, without adequate forethought, and there is reason for all of us to be concerned, and there is reason for pause, research, reflection before we move forward. Thank you.
CHAIRMAN EDLEY: Thank you very much, Janell. We now turn to two marvelous discussants, and let me start with -- Kati Haycock? Are you the first up? No, Constance Newman, who's the Under Secretary at the Smithsonian Institution and formerly directed the Office Of Personnel Management. And we've asked her and Kati to provide a concomitance before we open it up for general discussion.
MS. NEWMAN: Thank you. I thought you would mention the thing that I'm most proud of. I was in the original BOTA group, and it's a pleasure for me to be here and to participate in this conference in Regulatory -- sorry Gary, but the title did say Regulatory -- and Licensing Issues associated with the tests.
At the outset, I'd like to say that nothing I say should be taken as an indication of my opposition to the Voluntary National Test. To the contrary, I believe that these tests represent the hope that all children everywhere will begin to master the basics.
I've been involved in the last two years in the District of Columbia on the Control Board and recognize, if you took the District of Columbia alone, testing could or should mean that in the District there would be a reversal of the trends in recent years. Over the past five years, the erosion in the District's public schools has accelerated for thousands of children, particularly those in the poorest Wards.
In the comprehensive tests of basic skills, the Math scores have declined by 6 percent and Reading by 10 percent. And on NAEP, the trial assessment, 78 percent of the 4th Grade students scored below the basic reading level. So I believe -- I understand the reservations here -- but I have hope that the tests will give parents and teachers and leaders in the school administrations an opportunity to measure progress or the lack thereof, against national standards.
So what do I have to say since I've said that? I do have some concerns and some, I will maybe call them observations, in three categories: observations with regard to the impact of the test on parents and children; secondly on teachers and school administrators; and finally, the impact of the test reporting to elected officials, funding sources, and the public.
With regard to the impact on parents and children, everyone in the testing program, developing the policies, should be concerned about the challenges presented this morning by Richard Jaegar. It's important -- it's clearly important that parents understand what the results mean. They must not become so confused by the results that they are unreasonable with their children and unfair to the teachers.
And what I mean by saying unreasonable with their children, by overreacting to what may be viewed as low scores, or by not reacting, and thus not providing the reinforcement of the teacher's effort, we, I think, would all have failed in what it is we're trying to accomplish. They can be unreasonable with regard to relating to the teachers by being unrealistic about the speed with which change and test performance can take place.
We were involved in a major change in the District of Columbia in the Administration of the school system. They've been in place about eight months and we're getting beat up on a regular basis because the student's test scores haven't improved; that they aren't reading at a higher level.
And I am concerned that unless there's clear discussion of expectations going out with the test strategy, there's going to be a great deal of unfair pressure on the teachers and the school administrators, which is not going to improve the relationships that parents have with teachers.
With regard to the impact on teachers in schools, the requirements that students with disability and limited English proficient students be included is not only fair but it's an honest way for schools to really understand the performance level of all the children and to act accordingly.
Hopefully, hopefully, the teachers will not view this whole process of one of competition and thereby reacting in a negative way that they are having to include all of the students and therefore bringing down the scores to not allow them to compete with whomever they think they're competing with.
We should all be concerned about the findings of Mary Lee Smith and Laurie Shepherd, that it will be difficult to report results in ways that convince teachers that national tests are worthy of their attention. My question is, how are we going to get the buy-in of the teachers?
And if there's not buy-in -- and this was pointed out earlier today -- if there's not buy-in we'll have all these scores and nothing will have changed in the classroom with regard to method and content. And so the purpose, what I believe is the ultimate purpose, will not have been met.
And finally, with regard to the impact of reporting to elected officials, to funding sources, and to the general public, I heard Gary today and have read also, the point that no data from individual students will go to the Department of Education.
So this doesn't relate -- my concern does not relate to the individual data going to the Federal Government, but it does relate to concern about the reporting through media to the public, particularly about aggregated test results at any level -- and I'm marrying some things that have been said earlier --because it can cause incorrect conclusions to be drawn about groups -- that's minorities, the disabled, and people with limited English.
I've always been very sensitive on this point and some people say, somewhat unreasonable, but I believe this country is already divided too much. There are too many assumptions about people by groups, and a constant barrage of information that says, African-American children are performing below the national average by X percent in every venue of testing. Even though it is true, if it is not described in the proper context is going to give heart to those who want to believe and say that African-Americans are inferior.
Now I know all the academics and everybody here is going to say, you know, we have to be honest with the data. I'm only saying that when we report the aggregated data we have to be very careful about how it is used and who uses it, and what words are used, because it could be used as ammunition to further divide the nation.
And I will just say that I do believe what mitigates against this concern are some things that Gary said. If it is true that in this process, guidelines are going to be developed and the public is going to be able to participate in the guidelines, others who are concerned about this and know how best to communicate this will influence what goes out, then maybe I have no reason to be as nervous as I am.
And I will just close by saying that I picked up a few action items that I hope you have on your notes, Gary. One being that there should be extensive work done to remove the incentives for misuse of these tests. There is a need to grapple with the fact that there are students whose characteristics are not being met in the initial design; for example, students with language other than English and Spanish.
You need to be sure to spend time on communication strategies, and -- this I'm just repeating -- particularly communication strategies with parents. And the first question that Janell mentioned, I think you do have a purpose. I hope you do. I thought I heard a purpose, but what I pick up from what was said today is, there is need to have a clear statement of that purpose that is broadly communicated, and a clear statement and understanding by all, about the uses.
So with that I will end. I do echo another statement made by Janell that in getting this information out, it has to be in the proper context, and we all have to have to be working toward ensuring that once the information is there, somebody does something to fix whatever is broken.
CHAIRMAN EDLEY: Thanks very much, Connie, for all of those thoughts. Kati Haycock is one of the nation's leading child advocates in the field of Education, and she was formerly the Executive Vice President of the Children's Defense Fund, the nation's largest child advocacy organization, and is currently the Director of the Education Trust which was established in 1992. And the focus of its activities concerning children are really the interest and concerns of poor and minority kids. Kati.
MS. HAYCOCK: Thanks, Chris. As Chris indicated, I head up an organization called the Education Trust, whose sole purpose is to improve the education provided to minority and poor children in this country and in so doing, to close the gap between groups.
I want to take a couple of minutes today and explain to you why someone like me is persuaded that done right -- and I want to emphasize "that done right" -- a national test can be an important tool in the larger effort to accomplish that goal. I want to do that really, by taking us away from the technical language and talk with you a little bit about kids in classrooms.
When you spend as much time as I do, along with my staff, in classrooms, you can't help but be overwhelmed sometimes by the enormous inequities, many of which Janell described: the differences in facilities, the huge differences in instructional equipment like computers and laboratory stuff, and the large differences in the training of those who teach for minority youngsters and those who teach other youngsters.
But at least in our judgment, none of the inequities are more damaging to the achievement levels of poor, minority youngsters, than the low level curriculum and low expectations that guide their education.
In middle grades classrooms in inner cities, we typically see more coloring assignments than Writing or Mathematic's assignments. And I'm not kidding about that. I can take you to countless school districts where youngsters are asked to draw complicated borders on the outsides of their Mathematic's homework and are graded as much for staying in the lines in their coloring as they are for the quality of their Mathematics.
I can take you to urban high schools where there's a lot of coloring that goes on too, and where English teachers think it's a criminal act to assign more, for example, than a 3-paragraph essay to 11th grade kids.
In fact, my staff came back not too long ago from Philadelphia with an assignment that was given by a Philadelphia teacher to her largely poor and minority student population. This was the assignment. The assignment was to choose an historical figure who interests you, do some research -- and then here's what you do.
You find a picture of that person, you xerox it, you glue it on the center of a posterboard, and then around the picture you illustrate, decorate the poster with colors and glitter and paint. And then on a 3X5 card in each of the four corners of that poster, write a sentence or two about what you learned.
Now if I asked you to tell me what age level kids that would be an appropriate assignment for, most of you would probably say about 4th Grade. This was an 11th Grade classroom though, and the kids had one month to do that assignment. Moreover, what they got the other months looked quite a bit the same.
This problem may be even worse in LEP classrooms where teachers routinely conclude that because youngsters lack English language skills that they also lack cognitive ability. So their science content is about building dioramas and coloring pictures of fish and hanging them in a diorama. That's the sort of sum total of the content of their science.
Basically, what we do in American schools is we take kids who have less to begin with and we teach them less in school, too. Now, these practices continue, at least in part, because they're very much hidden from public view. They're hidden from parents by the A's that their kids bring home for work that would earn a C or a D in the suburbs. They're hidden from communities by reports that their students are achieving at the 45th percentile or the third stanine, whatever that means.
They're