Testcraft: A Teacher`s Guide to Writing and Using Language Test Specifications

Testcraft: A Teacher`s Guide to Writing and Using Language Test Specifications

Testcraft: A Teacher`s Guide to Writing and Using Language Test Specifications

Testcraft: A Teacher`s Guide to Writing and Using Language Test Specifications

eBook

$34.49  $40.00 Save 14% Current price is $34.49, Original price is $40. You Save 14%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

The creation of language tests is—and should be—a craft that is accessible and doable not only by a few language test experts, but also by many others who are involved in second/foreign language education, say the authors of this clear and timely book. Fred Davidson and Brian Lynch offer language educators a how-to guide for creating tests that reliably measure exactly what they are intended to measure. Classroom teachers, language administrators, and professors of language testing courses will find in this book an easy and flexible approach to language testing as well as the tools they need to develop tests appropriate to their individual needs.

Davidson and Lynch explain criterion-related language test development, a process that focuses on the early stages of test development when the criterion to be tested is defined, specifications are established, and items and tasks are written. This process helps clarify the description of what is being measured by a test and enables teachers to give input on test design in any instructional setting. Informed by extensive research in criterion-referenced measurement, this book invites all language educators to participate in the craft of test development and shows them how to go about it.

Product Details

ISBN-13: 9780300133813
Publisher: Yale University Press
Publication date: 10/01/2008
Sold by: Barnes & Noble
Format: eBook
File size: 3 MB

About the Author

Fred Davidson is associate professor of English as an International Language at the University of Illinois, Urbana-Champaign. Brian K. Lynch is associate professor in applied linguistics at Portland State University.

Read an Excerpt



Chapter One


The Nature of
Testcraft


Unity in Diversity

Testcraft is a book about language test development using test specifications, which are generative blueprints for test design. Our book is intended for language teachers at all career levels, from those in degree or training programs to those who are working in language education settings. We assume no formal training in language testing, no training in educational or psychological measurement, and (certainly!) no training in statistics.

    We wish to begin this book with a very important fundamental premise. Language educators—the readers of this book—are a profoundly diverse group of people. The variety and scope of language education is amazing and heart warming. It reflects what is best about being a teacher: that dazzling reality of being cast loose to help guide students through the challenge of learning. We face and surmount the challenge of our jobs in many ways. We do the best we can with what we have available. We are a profession united in our diversity, and we are comfortable with the philosophical reach displayed in daily practice. There is no single best model of language teaching. There is no single best teaching approach. So too there is no single best language test method.

    That said, is there some unifying principle or common theme across this diversity? We believe that the first best theme that unites us all is a desire to help our students learn. Tests play a central role in that dynamic. Recent scholarship and philosophy of testing hasemphasized the consequences of tests: a test is not a thing in and of itself, it is a thing defined by its impact on the people who use it. This has become known as "washback" or "backwash" in language testing literature. Does the test foster educational growth? Do students suffer or benefit from it? Do educational systems suffer or benefit from it?

    Our approach to test development is intended to be inclusive, open, and reflective. We promote tests that are crafted by a group of invested individuals, and we especially promote the inclusion in that group of individuals not normally invited to test development discussions. We hope that our approach enhances the impact of testing in a positive way; we hope that testcraft enhances positive washback. More to the point, we believe that the only kind of washback worthy of our time and energy is positive washback, so we hope that testcraft is an act of washback.

    To do so, a test must be tuned to its setting. It must reflect the desires and beliefs and resource capacities available to the educators in its context. These features vary widely. One setting may be deeply committed to some new language teaching methodology, even to the extent that staff have written uniquely designed, and strictly adhered to, instructional materials. Another setting may have a firmly egalitarian mindset: teachers receive some guidance and coordinate with each other in a loose manner, but once inside the classroom, each teacher is his or her own boss. Other settings may be in the creative chaos of change, as some new instructional philosophy takes hold. We hope this book speaks to all three settings and to many more.

    When it comes time to develop a test, simple recipes will not work. It is not possible for us to dictate to you: this is how to test intermediate listening comprehension; this is how to assess advanced ability in writing; this is how to measure beginning grammar. Some particular test method proposed for some particular skill may fit one or two settings very well once or twice. It may fit other settings less well but more frequently. But it will fit all settings very rarely—or more likely never.

    We emphasize that you should write your own recipes rather than follow test recipes given to you. More accurately, we provide a demonstration; we give you a wide array of sample recipes while you acquire the ability to render your own belief systems in recipe form. You may find that one or two of the test techniques discussed in this book may fit your needs at some point (and we would be pleased if that were the case), but we are not presenting our recipes as finished or recommended test products. Instead, what is important is to illustrate the process: How do recipes get crafted? We hope our discussion of the process in developing these recipes will almost always seem relevant, because it is that process we wish you to take away. This requires learning the basic tool of testcraft—the basic recipe format of test development—a test specification.


The Basic Tool of Testcraft

Testing is like a trade guild. You have to train long and hard to learn to be a carpenter or chef or a language tester. One accepted end to the training is the Ph.D. in language assessment followed by (or in some training systems, preceded by) a lengthy internship during which veteran testers and the seemingly incontestable evidence of empirical trial continuously judge your work. From time to time, there are visitors to the various sites of guild activity such as conferences, workshops, or professional journals. These visitors may present papers, publish an article, or participate in a test development project. Gradually, over time, the visitors may transition from newcomers to apprentices to masters within the guild. Alternatively, they may never really join the guild; they may decide that language testing is not for them. A great tragedy is that they often take away with them energy and knowledge and creativity that the guild needs in order to continue to thrive and to grow.

    We see in the existing language testing textbooks an absence of an inclusive model of test development. The current language testing textbooks either concentrate on nonstatistical test building advice that still assumes familiarity with the statistical foundations (or makes it a requisite step in the process), or they concentrate on statistical models that are beyond the reach of potential newcomers. These texts are excellent, make no mistake, but they are written primarily for language testers; that is, they are "guild-internal."

    We want to open the guild up to newcomers by revisiting and revising the rules of entry. We want to provide the best of our experience and (we believe) the most important bits and pieces of standard guild practice. That is, we do not abrogate good practice, and much of what we advocate in this book is standard everyday activity for test developers around the world. In order to open it up, we wish to redefine the activity of the "guild" to be a "craft"—testing is and should be accessible to and executable by a large number of people. By analogy, you can prepare excellent food yourself without formal training and admission to the chef's guild. Likewise, a basement carpenter can use the same tools as a master and achieve perfectly acceptable results, and home plumbing repair is within reach of the patient homeowner. Language testing should be seen in the same light as these crafts. You can get very good results if you use a few of the most crucial tools.

    The chief tool of language test development is a test specification, which is a generative blueprint from which test items or tasks can be produced. A well-written test specification (or "spec") can generate many equivalent test tasks. Our review of the historical literature on test specifications reveals a serious lack of attention to how specs come to be well-written; it is that particular part of our craft which we wish to outline in this book.

    There may be many ways to cook a particular dish. We want to help you write recipes. What you decide to cook is up to you: lasagna, stir-fried vegetables, or a chocolate cake. How you decide to cook it is also up to you: what spices you use, what quality of ingredients—these and many other choices should reflect the unique needs of your setting, and not our preconceived notions about what is or what is not "right." Our job is to help you acquire tools to write the recipe, regardless of what you want to eat.

    We intend this book to be an accessible, energetic, readable, nonthreatening text, but one still very much in tune with the existing level of craft knowledge within the language testing guild. If you exit the book with the skills to write your own recipes (your own specifications), then you have taken a major step toward test reform. We will return to this important point in Chapter 7, but at this stage we would make the following simple observation: test specs can become more than generative test development engines. They can become focal points for critical dialogue and change in an educational system. In fact, modern scholarly attention to specs and our own first exposure to them came from a historical reformist debate in educational and psychological measurement, to which we now turn.


A Bit of History

A test specification is not a new concept. Probably derived from the industrial concept of a "specification" for a factory product or engineering objective, the earliest mention we have located in educational and psychological assessment was by Ruch (1929; Gopalan and Davidson 2000). Ruch noted that the term was "adopted," presumably from another source, such as industry. The meaning of the term was then as it is now: to provide an efficient generative blueprint by which many similar instances of the same assessment task can be generated.

    Ruch acknowledged that specifications could greatly assist in the creation of what was called "objective" testing. An objective test avoids the putative subjectivity of expert-rated tasks. Each task is scorable against an answer key; the multiple-choice item type is the most familiar version of objective testing, and true-false and matching items are also considered objective. In order to achieve consistent objective testing, it is necessary to control the production of large amounts of similar items. Hence, test specifications become crucial.

    Objective testing was the early name for what we would today call psychometric norm-referenced measurement (NRM). The goal of such testing is to control the distribution of examinee results on the total score. Well-constructed objective norm-referenced tests (NRTS) consistently and accurately yield examinee distributions in a familiar bell-curve shape. The meaning of the result then becomes the position of a particular student on that curve: What percentage of the student's score is below that particular student's result? Technology has evolved to ensure this distributional shape, and a vast number of large-scale modern tests have resulted.

    For the past four decades, an alternative paradigm has existed, and our own exposure to test specifications came from our training within this paradigm. Criterion-referenced measurement (CRM) has been a topic of debate and research in educational measurement. Discussions of this topic have occurred under various labels: criterion-referenced measurement, domain-referenced measurement, mastery testing, and minimum competency testing.

    Historically, CRM has been defined in opposition to NRM. The distinction was first made in an article by Glaser and Klaus in 1962; however, it was the essay published by Glaser in the following year that is most often cited (Glaser 1963). This paper was only three pages in length, but it generated a paradigm in educational measurement that is still active and relevant today. Glaser defined two types of information that can be obtained from achievement test scores. The first is associated with CRM: "the degree to which the student has attained criterion performance, for example, whether he can satisfactorily prepare an experimental report" (Glaser 1963/1994, p. 6). The second type of information is associated with NRM: "the relative ordering of individuals with respect to their test performance, for example, whether Student A can solve his problems more quickly than Student B" (Glaser 1963/ 1994, p. 6). The new direction in testing generated by Glaser's article was characterized by the following: "a student's score on a criterion-referenced measure provides explicit information as to what the individual can and cannot do. Criterion-referenced measures indicate the content of the behavioral repertoire, and the correspondence between what an individual does and the underlying continuum of achievement. Measures which assess student achievement in terms of a criterion standard thus provide information as to the degree of competence attained by a particular student which is independent of reference to the performance of others" (Glaser 1963/1994, p. 6).

    The promise of Glaser's call for CRM was first established by Popham and Husek (1969), who detailed the advantages of CRM over NRM in the context of individualized instruction—that CRM would provide the level of detail needed to monitor student progress, would allow for the assessment of student performance in relation to instructional objectives, and would therefore also be useful in program evaluation.

    This was followed by the work of researchers such as Hambleton and Novick (1973), who focused on the measurement problems associated with CRM. They looked at the special requirements for constructing CRTS, including how to establish mastery levels, or cut scores. This concern with cut scores is often mistaken for the defining characteristic of CRTS.

    Hively and his associates (Hively et al. 1973) developed procedures for the specification of criteria and the sampling of items to represent those criteria. For example, they presented minutely defined specifications, or "formalized item forms," for test items designed to assess criteria such as the ability to compare the weights of two objects using mathematical symbols. This work has been referred to as domain-referenced measurement (DRM), and there is some disagreement in the literature as to whether this is the same as or different from CRM. Popham (1978) seemed to argue that they refer to the same general approach, and chose CRM for the pragmatic reason that most of the literature takes this label. However, others such as Linn (1994) have argued that DRM, in its pure form, led to overly restricted specification of what was to be assessed, and that CRM is distinguished by its ability to avoid such overspecification.

    Scholarship in applied linguistics also took notice of the distinction between CRM and NRM. Cartier (1968) is possibly the earliest mention of CRM in language testing circles, and Ingram (1977) depicted CRM as focused on a "pattern of success or failure" and the construction of "homogeneous blocks of items ... to test mastery of a particular teaching objective" (Ingram 1977, p. 28). Following that early work, however, there was little or no discussion of CRM until Cziko (1982), followed by Hudson and Lynch (1984). Since then, there has been a surge in interest in the principles of CRM within the language testing community (Bachman 1989, 1990; Brown 1989, 1990; Cook 1992; Davidson and Lynch 1993; Hudson 1989, 1991; Hughes 1989; Lynch and Davidson 1994).

    The fundamental contribution and relevance of CRM was underscored by the 1993 annual meeting of the American Educational Research Association, which presented a symposium on CRM commemorating the thirtieth anniversary of the influential Glaser article. The symposium set the tone for more recent work in this area, bringing together the luminaries of CRM (at least the North American variety)—Ronald Hambleton, Robert Linn, Jason Millman, James Popham, and the originator himself, Robert Glaser. Papers from the meeting were published as a special edition of Educational Measurement: Issues and Practice in 1994 (13, p. 4). This publication reminds us that the unique contributions of CRM have been a focus on test specifications and the clear referencing of test scores to content, and that these characteristics make CRM particularly relevant for the present emphasis on performance testing and authentic assessment. These emphases are also precisely why we have included this lengthy discussion of the history of CRM.

    The historical development of CRM, then, has been realized for the most part in opposition to NRM. Its history has tended to focus much of its research on comparative statistical procedures for the analysis of test items traditionally used in NRM. This is unfortunate, because such comparative research has diverted attention from the real contribution made by CRM—clarity in test method, content, and construct.

    We began our scholarly training as testers rooted in the CRM/NRM distinction. We believed firmly that the CRM approach to test development was far superior to that employed in NRM for many testing purposes (especially achievement testing). As we wrote and presented talks on this topic, and as we employed specs in our work, and as we ran workshops on test specifications, we came to realize that the NRM/CRM distinction is not as necessary as we once thought. Good tests involve clear thinking, and regardless of the use of the test score, certain fundamental practices always seem to apply.

    To capture these fundamentals, there is a phrase we enjoy: "iterative, consensus-based, specification-driven testing." We advocate tests that are developed in an iterative manner: there are cycles of feedback-laden improvement over time as the test grows and evolves. We advocate tests that are consensus-based: the test should result from dialogue and debate among a group of educators, and it should not result from a top-down dictate, at least not without negotiation from the bottom-up as well. And finally, we advocate tests that are specification-driven: a specification is an efficient generative recipe for a test that fosters dialogue and discovery at a higher, more abstract level than achieved by analysis of a simple item or task.

    Whether or not the test is CRM or NRM is a different matter. The CRM/NRM distinction concerns mainly the use that is made of the test result: Is it used to say something about mastery of a particular set of skills, or is it used to rank examinees, or is it used to do both? To us, a test of any purpose should be developed in an iterative and consensus-based form, and such development can be achieved through specifications.

(Continues...)


Excerpted from Testcraft by Fred Davidson, Brian K. Lynch. Copyright © 2002 by Yale University. Excerpted by permission.

Table of Contents

Acknowledgmentsx
Chapter 1: The Nature of Testcraft
Unity in Diversity1
The Basic Tool of Testcraft3
A Bit of History4
Clarity and Validity8
Purpose9
Teaching10
Specificity of the Criterion12
The Five-Component Popham Test Specification Model14
Wrapping Up15
Exercises16
Outline of the Remaining Chapters17
Chapter 2: The Components of Test Specifications
Specification Format and Purpose20
Specification Components20
The General Description20
The Prompt Attributes Section22
The Response Attributes Section25
The Sample Item26
The SpecificationSupplement27
All Together Now28
Alternative Approaches to Specification Format29
Exercises33
Chapter 3: Problems and Issues in Specification Writing
Problem 1: The Role of the GD34
Problem 2: The Difference between the PA and the RA37
Problem 3: Reverse Engineering41
Problem 4: Item/Task Fit-to-Spec44
Problem 5: The Event versus the Procedure48
Problem 6: Specplates50
Problem 7: "Speclish" and Level of Generality53
Problem 8: Ownership57
Concluding Remarks about the Process59
Chapter 4: Building the Test
Unity, Chance, and Control in Test Design60
The Nature of Feedback63
The Set-in-Stone Phenomenon64
Aggregation and the Table of Specifications66
The Importance of Response Data68
Banks and Maps71
"Finalizing" the Operational Measure and the
Go/No-Go Decision73
Concluding Remarks76
Chapter 5: The Mandate
Story 1: The Reactive EFL School78
Story 2: The IMAGE81
Story 3: The UCLA ESLPE84
Story 4: Lowering Admissions Scores88
Story 5: "It's Time to Write the Test Again"91
Story 6: Shorten the Test93
Conclusions96
Chapter 6: The Team
The Power of Groups98
The Structure of a Group99
The Process of a Group101
The Testcraft Group Dynamics Study107
Participants107
Data Gathering and Analysis108
Findings from the Testcraft Group Dynamics Study109
Conclusions from the Testcraft Group Study117
Exercises on Group Process and Testcrafting118
Chapter 7: The Agency of Testcraft
Review and Preview121
Determinism in Testing122
Advocacy in Testing123
Advocacy by Whom and How?124
Advocacy for Whom?128
Thought Exercises130
Appendix 1: Some Essentials of Language Testing
The Classical Purposes of Testing131
Operationalization and the Discrete-Integrative
Distinction132
Statistics132
Reliability and Validity133
Appendix 2: Teaching Specification Writing136
Bibliography141
Index147
From the B&N Reads Blog

Customer Reviews