Evaluation Strategies
for Human Services
Programs


A Guide for Policymakers
and Providers


Adele Harrell
with
Martha Burt
Harry Hatry
Shelli Rossman
Jeffrey Roth
William Sabol

Urban Institute, Washington, D.C.  The Urban Institute
 Washington, D.C.


Contents
Clarifying the Evaluation Questions, 2
Developing a Logic Model, 3
Assessing Readiness for Evaluation, 7
Selecting an Evaluation Design, 8
Identifying Potential Evaluation Problems, 27
Conclusions, 30

EXHIBITS
Exhibit A: Logic Model Used in Evaluation of the Children At Risk Program, 6
Exhibit B: Process for Selecting Impact Evaluation Designs, 18

Evaluation Strategies for
Human Services Programs

A Guide for Policymakers and Providers

In the continuing effort to improve human service programs, funders, policymakers, and service providers are increasingly recognizing the importance of rigorous program evaluations. They want to know what the programs accomplish, what they cost, and how they should be operated to achieve maximum cost-effectiveness. They want to know which programs work for which groups, and they want conclusions based on evidence, rather than testimonials and impassioned pleas.

This paper lays out, for the nontechnician, the basic principles of program evaluation design. It signals common pitfalls, identifies constraints that need to be considered, and presents ideas for solving potential problems. These principles are general and can be applied to a wide range of human service programs. We illustrate these principles here with examples from programs for vulnerable children and youth. Evaluation of these programs is particularly challenging because they address a wide diversity of problems and possible solutions, often include multiple agencies and clients, and change over time to meet shifting service needs.

Steps in Selecting the Appropriate Evaluation Design. The first step in the process of selecting an evaluation design is to clarify the questions that need to be answered. The next step is to develop a logic model that lays out the expected causal linkages between the program (or program components) and the program goals. Without tracing these anticipated links it is impossible to interpret the evaluation evidence that is collected. The third step is to review the program to assess its readiness for evaluation. These three steps can be done at the same time or in overlapping stages. For expositional clarity we will discuss each of them in turn. We will then describe how to select the best design for a given purpose from among the major types of evaluation that exist.

Clarifying the Evaluation Questions
The design of any evaluation begins by defining the audience for the evaluation findings, what they need to know, and when. These questions determine which of the following four major types of evaluation should be chosen:

Impact evaluations focus on questions of causality. Did the program have its intended effects? If so, who was helped and what activities or characteristics of the program created the impact? Did the program have any unintended consequences, positive or negative?

Performance monitoring provides information on key aspects of how a system or program is operating and the extent to which specified program objectives are being attained (e.g., numbers of youth served compared to target goals, reductions in school dropouts compared to target goals). Results are used by service providers, funders, and policymakers to assess the program's performance and accomplishments.

Process evaluations answer questions about how the program operates and document the procedures and activities undertaken in service delivery. Such evaluations help identify problems faced in delivering services and strategies for overcoming these problems. They are useful to practitioners and service providers in replicating or adapting program strategies.

Cost evaluations address how much the program or program components cost, preferably in relation to alternative uses of the same resources and to the benefits being produced by the program. In the current fiscal environment, programs must expect to defend their costs against alternative uses.

A comprehensive evaluation will include all these activities. Sometimes, however, the questions raised, the target audience for findings, or the available resources limit the evaluation focus to one or two of these activities.

Whether to provide preliminary evaluations to staff for use in improving program operations and developing additional services is an issue that needs to be faced. Preliminary results can be effectively used to identify operational problems and develop the capacity of program staff to conduct their own ongoing evaluation and monitoring activities.(1) But this use of evaluation findings, called formative evaluations, presents a challenge to evaluators who are faced with the much more difficult task of estimating the impact of an evolving intervention. When the program itself is continuing to change, measuring impact requires ongoing measurement of the types and level of service provided. The danger in formative evaluations is that the line between program operations and assessment will be blurred. The extra effort and resources required for impact analysis in formative evaluations has to be measured against the potential gains to the program from ongoing improvements and the greater usefulness of the final evaluation findings.

Developing a Logic Model
It is impossible to interpret evaluation findings without a clear understanding of program goals, implementation sequences, and the expected links between them and expected program benefits. Expectations about these linkages are made explicit by developing a logic model. Such a model is developed by discussing with service providers and funders the goals of and rationales behind program organization and content, examining planning documents and program reports, and reviewing research findings on similar programs or problems. The literature review may be particularly helpful in identifying plausible causal links and any factors other than the program which should be considered in the evaluation.

The logic model provides a simplified description of the program, the intended outputs, and the intended outcomes. Program characteristics include the population to be reached, the resources to be used, and identification of the types and levels of service elements. Outputs are immediate program products resulting from the internal operations of the program, such as the delivery of planned services. Examples of output indicators in the area of programs for vulnerable children and youth might include the numbers of children immunized, home visits by case managers, or youth completing a job training program. These program outputs are, in turn, the vehicle for producing the desired program outcomes, for example, decreases in childhood illnesses, decreases in abuse and neglect cases, or increases in youth employment. Careful attention must be paid to when the anticipated outcome should be expected to occur. For this reason it is often useful to divide outcomes into intermediate versus longer term. For example, improved school attendance in early grades might be an intermediate outcome associated with the longerterm outcome of dropout prevention. Care must be given to focusing on outcomes which will occur within the study period.

A classic failure in selecting an outcome that is expected to occur within the time frame of the study occurred in evaluations of the DARE drug prevention program, an educational program for fifth and sixth graders designed to prevent drug use. Evaluation results showed no significant prevention of drug use at the end of the program. This result should have been anticipated, since drug use does not typically begin among youth in this country until the mid-teen years (14 to 17). An age-appropriate intermediate outcome should have been selected as the primary outcome measure, such as improved peer resistance skills and changes in beliefs about the risks of drug use.

The logic model should also include explicit mapping of the conditions present in the program environment or characteristics of the target group or community that may affect the program's ability to achieve its goals. Non-program characteristics of the program organization, community or target population that are likely to influence the outputs and outcomes and/or use of program services are called antecedent variables. Conditions or events in the program, target population, or community that may limit or expand the extent to which program outputs actually produce the desired outcomes are called mediating variables. For example, a drug abuse prevention program may be less effective if the program staff are inexperienced, or if the local community offers fewer recreational alternatives to substance abuse and/or more active open drug markets (antecedent variables). Offering other support services in combination with the program may enhance its impact (a mediating variable).

In impact evaluations the logic model is used to spell out how, and for whom, certain services are expected to create specific changes/benefits. For example, if the program includes parenting classes, the logic model will identify this activity as a key program component and show the types of changes in parenting that will be used to measure program outcomes (e.g., by improving parental assistance with homework or helping parents communicate more effectively with adolescents).

In performance monitoring, the logic model is used to focus on which kinds of output and outcome indicators are appropriate for specific target populations, communities, or time periods. For example, among indicators of child improvement in school, one might expect attendance to improve in the first semester of a program, but academic test score improvement only after a significant period of program participation-with the timing possibly varying by the age and developmental stage of the children.

In process evaluation, the logic model is used to identify expectations about how the program should work-an "ideal type"-which can then be used to assess the deviations in practice, why these deviations have occurred, and how the deviations may affect program outputs. This assists program managers (and evaluators) to identify differences (including positive and negative unintended consequences), consider possible mechanisms for fine-tuning program operations to align the actual program with the planned approach, or re-visit program strategies to consider alternatives.(2)

Logic models are constructed to show temporal sequences, building left to right, and they typically diagram relationships with arrows. An example of a logic model is shown in Exhibit A. It was developed by the Urban Institute during the planning of the evaluation of the Children At Risk program (CAR). CAR is an intensive intervention program designed to prevent involvement in drugs and crime, and to foster healthy development among adolescents ages 13 to 15 who exhibit serious risk indicators and live in severely distressed inner-city neighborhoods.


The intervention consists of eight required program components:

Case Managers employed by the program make a service plan for all members of the household of participating youth and provide intensive follow-up on referrals to needed services, handling a. caseload of 15;

Family Services include parenting skills training for all parents, and referral to other services as needed (intensive family counseling, stress management/coping skills training, identification and treatment of substance abuse, health care, job training and employment programs, housing, and income support services);

Education Services include tutoring or homework assistance for all youth, and referral to other services as needed (educational testing, special education classes);

After-School and Summer Activities for all CAR youth include recreational programs and life-skill/leadership development activities, combined with training or education;

Mentoring is provided by local organizations for youth in need of a caring relationship with an adult. The role of the mentor is to: (a) inform youth about alternative available choices (e.g., activities and goals); (b) familiarize them with strategies available for pursuing those choices; (c) provide training, opportunities for practice, and feedback in the development of skills for implementing particular strategies; and (d) provide relationships through which youth are affirmed, inspired, and encouraged to make healthy choices;

Incentives such as gifts and special events are used to build morale and attachment to the pro-social goals of the program (e.g., gift certificates, trips, and vouchers for pizza, sports shops, movies, and stipends for community service during summer programs);

Community Policing/Enhanced Enforcement is used in all target neighborhoods to create safer environments with less drug activity. Law enforcement activities include out-stationing police in schools and neighborhood locations to maintain order and enhance relationships with community groups;

Criminal/juvenile Justice Intervention involves collaboration between case managers and juvenile court personnel to provide community service opportunities and enhanced supervision of youth in the justice system.


Antecedent variables include the levels and types of neighborhood, family, peer group, and personal risk factors for participants as well as their demographic characteristics. These are influences that are present before the program intervention.

Mediating variables include exposure to other social or educational services, perceptions of opportunities, and social norms. These are influences that

Exhibit A
Logic Model Used in Evaluation of the Children At Risk Program

Logic Model Used in Evaluation of the Children At Risk Program


operate at the same time as the program is operating. The program components are designed to achieve the intermediate outcomes-reductions in risk factors and enhancement of protective factors at the end of program participation. These intermediate outcomes, measured at the end of program participation, are hypothesized to be requisite steps towards the desired longer-term outcomes-prevention of drug use, drug selling, delinquency, school failure and dropout, and teen parenthood.

Program outputs, not shown in this diagram, include indicators of performance such as the number of tutoring sessions provided, number of home visits by case managers, and number of times parents participated in program activities.

Assessing Readiness for Evaluation

Evaluability assessment is a systematic procedure for deciding whether program evaluation is justified, feasible, and likely to provide useful information. Questions to be considered in an evaluability assessment include: (3)


Is the program's logic model plausible given the resources available and guidance from the relevant literature? If program goals are unrealistic or the intervention strategies not well grounded in theory and/or prior evidence, then evaluation is not a good investment.

What kinds of data will be needed, from what number of subjects, and what data are likely to be already available? Evaluations should be designed to maximize the use of available data, as long as these are valid indicators of important concepts and are reliable. Available data may, for example, include government statistics, individual and summary agency records and statistics, and information collected by researchers for other studies. If there are crucial data needs not met with existing data, resources must be available to collect the requisite new data.

Are adequate resources and assets available-money, time, expertise, and community and government support? Are there any factors that limit or constrain access to these resources?

Can the evaluation be achieved in a time frame that will permit the findings to be useful in making program and policy decisions by federal, state, and local officials?

To what extent does evaluation information already exist somewhere on the same or a closely related intervention? The answer to this question can have important implications for action. Any successful previous attempts may yield promising models for replication. Lessons learned from previous unsuccessful attempts may inform the current effort. If sufficient evidence already exists from previous efforts, the value of a new evaluation may be marginal.

To what extent are the findings from an evaluation likely to be generalizable to other communities, and therefore useful in assessing whether the program should be expanded to other settings or areas? Are there unique characteristics of the projects to be evaluated that might not apply to most other projects? Program characteristics that are not generalizable reduce the value of any findings.

Selecting an Evaluation Design

Selection of the evaluation design follows the systematic consideration of these questions. As noted, there are four major types of evaluation: impact, performance monitoring, process, and cost. We discuss each in turn.

Impact Evaluation Designs

Three possible designs are possible for impact evaluations: experimental, quasi-experimental, and non-experimental. They all share the strategy of comparing program outcomes with some measure of what would have happened without the program. Experimental designs are the most powerful and produce the strongest evidence. These are not always possible, however, in which case one of the two other alternatives must be chosen. (A later section discusses how to make the choice.)


EXPERIMENTAL DESIGNS

Key elements. Experimental designs are considered the "gold standard" in impact evaluation. Experiments require that individuals or groups, such as classrooms or schools, be assigned at random (by the flip of a coin or equivalent randomizing procedure) to one or more groups prior to the start of services. The "treatment" group or groups will be designated to receive particular services designed to achieve clearly specified outcomes. If multiple treatment groups are designated, the outcomes for the treatment groups may be compared to one another to estimate the relative impact of the different services or the impact relative to a control group. A "control " group receives no services. The treatment group outcomes are compared to control group outcomes to estimate impact. Because chance alone determines who receives the program services, the groups can be assumed to be similar on all characteristics that might affect the outcome measures except the program. Any differences between treatment and control groups, therefore, can be attributed with confidence to the impacts of the program.

Design Variations.
One design variation is based on a random selection of time periods during which services are provided. For example, new services may be offered on randomly chosen weeks or days. A version of this approach is to use "week on/week off" assignment procedures. Although not truly random, this approach closely approximates random assignment if client characteristics do not vary systematically from week to week. It has the major advantage that program staff often find it easier to implement than making decisions on program entry by the flip of a coin on a case-by-case basis. A second design variation is a staggered start approach -in which some members of the target group are randomly selected to receive services with the understanding that the remainder will receive services at a later time (in the case of a school or classroom, the next semester or month). One disadvantage of the staggered start design is that the observations of outcomes are limited to the period between the time the first group completes the program and the second group begins. As a result, it is generally restricted to assessing gains made during participation in relatively short-term programs.

Limitations/Considerations. Although experiments are the preferred design for an impact evaluation on scientific grounds, random assignment evaluations are not always the ideal choice in real-life settings. Some interventions are inherently impossible to study through randomized experiments. Youth curfews, for example, cannot be enforced against a randomly selected subset of children in a community. And "week on/week off" enforcement is likely to breed contempt for both the law and enforcement.

A second consideration is whether random assignment is ethical and acceptable to the community. Public opinion may resist treating similar children differently on the basis of a coin flip or may view random assignment as exploiting vulnerable populations and powerless people. Carefully designed procedures for randomization may be able to overcome such resistance. One strategy is random selection of these to receive services from a list of those who meet eligibility requirements when resources are not available to serve everyone who is eligible. This form of drawing lots is close enough to "first come, first served" to be accepted as fair in many situations. Providing services for some clients at a later time (the next month or semester as described above) may satisfy community concerns about fairness and be consistent with available staff and resources. Sometimes, random assignment can involve relaxing a requirement instead of adding one, which makes randomization less controversial. Great care needs to be taken to ensure that the control group is not denied essential services they would otherwise have, that the benefits to participants and the community are carefully explained, and that program staff and participants understand and support the research. Many funders require a formal review of the research design by a panel trained in guidelines developed to protect research participants. Even when such review is not required, explicit consideration of this issue is essential.

A third important issue is whether the results that are likely to be obtained justify the investment. Experiments typically require high levels of resources--money, time, expertise, and support from program staff, government agencies, funders and the community. Evaluation planners have to ask themselves whether the answers to the list of evaluation questions-and the decisions on program continuation, expansion, or modification that will be made on the basis of the findings--could be based on less costly, less definitive, but still acceptable evaluation strategies.

Practical Issues. Experimental designs run the most risk of being contaminated because of deliberate or accidental mistakes made in the field. To minimize this danger, there must be close collaboration between the evaluation team and the program staff in identifying objectives, setting schedules, dividing responsibilities for record-keeping and data collection, making decisions regarding client contact, and sharing information on progress and problems. Active support of the key program administrators, ongoing staff training and communication via meetings, conference calls, or e-mail are essential.

Failure to adhere to the plan for random assignment is a common problem. Staff are often intensely committed to their clients and will want to base program entry decisions on their perceptions of who needs, or will benefit from, the program. To prevent this pitfall, procedures should be set up so that the evaluator, not program staff, is in charge of the allocation to treatment or control group. Statistical adjustments in the analysis may be needed if there are operational failures to maintain the randomization process(4). And even these may be inadequate to remove the biases thus introduced.

Another potential problem area is noncomparable information for treatment and control group members. Program staff can readily collect data and provide contact information for treatment group members because they have continuing contacts with clients, other agencies, and the community. Collecting comparable data and contact information on control group members can be difficult. If the experiment loses track altogether of more control than treatment group members, the evaluation data will not only be incomplete, it will provide distorted and therefore misleading information on what impacts the program has. The best way to avoid bias from this problem (called differential attrition) is to plan tracking procedures and data collection at the start of the evaluation, gathering information from the control group members on how they can be located, and developing agreements with other community agencies, preferably in writing, for assistance in data collection and sample member tracking. These agreements are helpful in maintaining sample continuity in the face of staff turnover at the agencies involved.

If the program services and content change over time, it may be difficult to determine what level or type of services produced the outcomes. The best strategy is to identify key changes in the program and the timing of changes as part of a process evaluation and use this information to define "types of program" variations in the program experience of different participants for the impact analysis. Other potential problems may be solvable through the use of special statistical techniques. Such problems include insufficient or unequal follow-up periods for treatment versus control,(5) and the risk of events (e.g., failure in school, incarceration, injury, moving) that are more likely to remove some types of members from a sample than others before the end of the planned follow-up period.(6)

Example. The evaluation of Project Alert, an eight-week junior high school curriculum for teaching seventh grade students to avoid drug use, used an experimental design(7). Thirty California and Oregon schools were randomly assigned to three groups: 1) students instructed by adult health educators, 2) students instructed by older teenagers, and 3) a no-treatment control group, although four of the non-treatment schools provided other drug prevention instructional programs. To increase the generalizability of the findings, the schools were drawn from eight urban, suburban, and rural communities and nearly a third of the schools had minority populations of 50 percent or higher. To increase the pre-assignment similarities of the three experimental groups and strengthen the statistical power of the analysis (given the relatively small sample of schools), each experimental group was included in at least one school in each community, and the schools included in the experiment were matched to the extent possible to reduce differences among groups in such characteristics as test scores, language spoken at home, drug use among 8th graders, and ethnic and income composition. These procedures produced substantial pre-experimental similarities in factors related to drug use among the experimental groups. Since schools but not students were randomly assigned, statistical adjustments were used to correct for the clustering of students within schools. Students completed questionnaires about their drug use seven times between grades 7 and 12; those who transferred to other schools or districts completed mail and telephone interviews to minimize sample attrition. Outcome measures included cognitive risk factors associated with drug use: beliefs about consequences of use, norms regarding drug use, peer resistance, self-efficacy, and expected future drug use.

Experimental evaluations are costly. The Children At Risk evaluation, for example, cost $1.5 million. But the rigorous design permitted strong conclusions about the long-term effectiveness of drug prevention education during early adolescence and demonstrated that results are not restricted to middle class communities, but can be used in schools with high proportions of lower income and minority students.

QUASI-EXPERIMENTAL DESIGNS
Key Elements. Like experiments, quasi-experimental evaluations compare outcomes from program participants to outcomes for comparison groups that do not receive program services. The critical difference is that the decision on who receives the program is not random. Comparison groups are made up of members of the target population as similar as possible to program participants on factors that could affect the selected outcomes to be observed. Multivariate statistical techniques are then used to control for remaining differences between the groups.

Usually, evaluators use existing population groups for comparison-those who live in a similar area, or are enrolled in the same school in a different classroom, or attended the same school with the same teacher in the previous year. In some situations, staff (or schools or communities) are willing or trained to try the new "treatment" while others are not, but the same rules for service eligibility are used by all.

Design Variations. The primary variation is to construct a comparison group by matching individuals to individuals in the treatment group on a selected set of characteristics. This process for selecting a comparison group is methodologically less defensible(8). The threats to validity are twofold. 1) Matches based on similarities at a single point in time do not always result in groups of individuals who are comparable over time. Thus, the groups may become increasingly different over time independent of the program. 2) Differences in variables not used in the matching may have a substantial effect independently of the program being evaluated.

Quasi-experimental designs vary in the number and timing of the collection of data on program outcome measures. The selection of the number and timing of measurements is based on an assessment of the potential threats posed by competing hypotheses that cannot be ruled out by the comparison methodology. In many situations, the strongest designs are those that collect pre-program measures of outcomes and risk factors and use these in the analysis to focus on within-individual changes that occur during the program period. These variables are also used to identify groups of participants who benefit most from the services. One design variation involves additional measurement points (in addition to simple before and after) to measure trends more precisely. Another variation is useful when pre-program data collection (such as administering a test on knowledge or attitudes) might "teach" youth about the questions to be asked after the program to measure change, and thus distort the measurement of program impact. This variation involves limiting data collection to the end of the program period for some groups, allowing their post-program answers to be compared with the post-program answers of those who also participated in the pre-program testing.

Considerations/Limitations. Use of non-equivalent control group designs requires careful attention to procedures that rule out competing hypotheses regarding what caused any observed differences on the outcomes of interest. In evaluations of programs for vulnerable children and youth, three threats to validity stand out.(9)

The first is the threat of "maturation"--the possibility that age-related processes will contribute to outcomes independently of the program intervention. Among youth, certain outcomes, positive and negative, are strongly tied to age--outcomes such as drug use, delinquency, and early parenthood. It is therefore necessary to be sure that the comparison group is made up of youth at the same developmental stage.

A second threat is that of "history"--the risk that unrelated events may affect outcomes. For example, the rapid spread of crack use among women childbearing age in the United States in the late 1980s greatly increased rates of drug-exposed infants. Thus, a comparison group for an evaluation of a prenatal health care program would need to be drawn from the same years and communities to "control" for the spread of crack. Otherwise, the upward trend in negative outcomes due to crack could obscure the prevention benefits of the program. Similarly, designs need to consider controls for geographic variation in events external to the program. For example, gang crackdowns in some neighborhoods and not others could influence assessments of the impact of a school-based delinquency or drug prevention program. If the crackdown occurred in the "treatment" neighborhood, the program effects might be over-estimated; if it occurred in the comparison neighborhood, program effects might be under-estimated.

A third threat to validity is the process of "selection "-the factors that determine who receives services. Some of these factors are readily identified and can be used as control variables in statistical models, such as living in a specific school district or meeting program eligibility criteria. However, it is unlikely that all factors will be correctly identified and adequately measured. For example, program participants may receive services because they are more motivated, skillful, or socially well connected than nonparticipants. Such differences are not easy to measure during a program evaluation.

Practical Problems. Building defenses or "controls" for threats to validity into evaluation designs through the selection of comparison groups and the timing of outcome observations is a challenge. Controls for maturation, history, and selection may involve, respectively, selecting a sample that includes multiple age cohorts, collecting data in similar or nearby localities that lack the program,(10) or applying a statistical model that controls for foreseeable biases in selecting program participants.(11) Even when the comparison group is carefully selected, the researcher cannot be sure that all relevant group differences have been identified and measured accurately. Statistical methods can adjust for such problems and increase the precision with which program effects can be estimated,"(12) but they do not fully compensate for the non-random design. Findings need to be interpreted extremely cautiously and untested alternative hypotheses carefully considered.

As in experimental evaluation, plans for quasi-experimental evaluations need to pay close attention to the problem of collecting comparable information on control group members and developing procedures for tracking them. However, the need for close collaboration with program staff is reduced, since the staff are generally neither involved in selecting participants nor in contact with comparison group members.

Example. The evaluation of the Teen Age Parenting Program (TAPP) for adolescents divided teen mothers into three groups designed to be similar in age and other characteristics.(13) Each group was evenly divided among black, Hispanic, and white participants. One group attended an alternative school with child development and parenting classes and a nursery school featuring a parenting-child development curriculum. Another group attended an alternative school without a nursery school. The remaining group received no special services for teenage parents. Services began during pregnancy. Assessments of educational progress, fertility, knowledge, and child development two to four years later were based on interviews and school records. Mothers in the alternative school with the nursery program had completed more schooling and were more likely to still be enrolled in school than the other mothers. Mothers in both alternative schools had more knowledge about parenting and reproduction and more positive attitudes about parenting than those without special services. But there were no significant differences in the groups on child development outcome measures. How to interpret, this seeming inconsistency is complicated, because the evaluation design did not have pre-program measures of individual differences and assignment was not random. The education and knowledge differences across the three groups may have been there from the beginning, rather than being attributable to the special services.

NON-EXPERIMENTAL IMPACT EVALUATIONS
Key Elements. Non-experimental impact evaluations examine changes in levels of risk or outcomes among program participants, or groups including program participants, but do not include comparison groups of other individuals or groups not exposed to the program.

Design Variations. The four primary types of non-experimental designs include: 1) before and after comparisons of program participants; 2) time series designs based on repeated measures of outcomes before and after the program for groups that include program participants; 3) panel studies based on repeated measurement of outcomes on the same group of participants; and 4) post-program comparisons among groups of participants.

The first two designs are based on analysis of aggregate data. In before and after comparisons, outcomes for groups of participants (program groups that enter the program at a specific time and progress through it over the same time frame) are measured before and after an intervention and an assessment of impact inferred from the differences. This simple design is often used to assess whether knowledge, attitudes, or behavior of the group changed after exposure to a classroom curriculum or job training program. Time series designs are an extension of the before and after design that uses multiple measures of the outcome variables before an intervention begins and continues to take multiple measures after intervention is in place. If a change in the trend (direction or level) in the outcome occurs at, or shortly after the time of the intervention, the significance of the observed change is tested statistically. Time series measures may be based on larger groups or units that include but are not restricted to program participants. For example, crime rates for neighborhoods in which most or all youth participate in a delinquency prevention program might be used to assess reductions in illegal activity. Evaluation of a series of dropout prevention activities offered across the school year could examine the percentages of entering classes that graduate over a period of years. Time series designs should be considered when it is difficult to identify who receives program services or when the evaluation budget does not support collection of detailed data from program participants. Although new statistical techniques have strengthened the statistical power of these designs,(14)" it is still difficult to rule out the potential impact of non-program events using this approach.

The next two designs examine data at the individual level. Cross-sectional comparisons are based on surveys of groups of participants conducted after program completion. This design can be used to estimate correlations between outcomes and differences in the duration, type, and intensity of services received, yielding conclusions about plausible links between outcomes and services but no definitive conclusions about what caused what. Panel designs use repeated measures of the outcome variables for each individual. In this design, outcomes are measured for the same group of program participants, often starting at the time they enter the program and continuing at intervals over time. For example, the evaluation of Health Planning and Promotion: Life Planning Education used pre-post data from participants to measure gains in understanding the best combinations of contraceptive methods and the consequences of early childbearing.(15) This design allows the characteristics of individual participants to be used in the analysis to identify different patterns of change associated with individual characteristics of participants and control for other events to which they were exposed.

Considerations/Limitations. Several limitations to non-experimental designs should be noted. First, the cross-sectional and panel designs provide only a segment of "dose-response curve," that is, only estimates of the differences in impact related to differences in the services received. These designs cannot estimate the full impact of the program compared to no service at all, unless estimates can be based on other information on the risks of the target population. Second, the designs that track participants over time (before and after, panel, and time series) cannot control for the effects of developmental changes that would have occurred without services, or for the effects of other events outside the program's influence. Third, the extent to which the results can be assumed to apply to other groups or other settings is limited, because this design provides no information for assessing the extent to which participants were selected into the program on the basis of factors which themselves influence outcomes.

Practical Issues. Non-experimental designs have considerable practical advantages because they are relatively easy and inexpensive to conduct. Individual data for cross-sectional or panel analysis are often collected routinely by the program at the end (and sometimes beginning) of program participation. When relying on program records, the evaluator needs to review the available data against the logic model to be sure that adequate information on key variables is already included, or to begin collecting additional data items if needed.

When individual program records are not available, aggregate statistics may be obtained from the program or from other community agencies with information on the outcomes among groups of participants. For example, crime rates, average promotion rates, and rates of births to teen mothers can be collected from existing records. The primary problem encountered in using such statistics for assessing impacts is that they may not be available for the specific population or geographic area targeted by the program. Often these routinely collected statistics are based on the general population or geographic areas served by the agency (e.g., the police precinct or the clinic catchment area). The rates of negative outcomes for the entire set of cases included may well differ from rates for the targeted group of vulnerable children and youth; this risk is greater for larger rather than smaller statistical areas.

A more expensive form of data collection for non-experimental evaluations is a survey of participants some time after the end of the program. These surveys can provide much needed information on longer-term outcomes such as rates of employment or earnings or high school graduation. As in any survey research, the quality of the results is determined by response rate rather than overall sample size, and by careful attention to the validity and reliability of the questionnaire items.

Example. The Youth Training Scheme (YTS) in Great Britain provides, through local agents, two years of vocational and on-the-job training for out-of-school and unemployed youth ages 16 and 17. The local agents are businesses or community organizations that receive government funds to design a training program, recruit and supervise youth, and provide at least 13 weeks of on-the-job training per year. Non-experimental evaluation of YTS was based on a follow-up survey of 63,000 former participants.(16) In addition to monitoring client satisfaction and job related outcomes, the survey was used in non-experimental comparisons of differences in outcomes related to differences among participants: job market outcomes were compared for graduates versus program dropouts and across youth who entered the program with different levels of motivation and past school achievement. Results indicate that program graduates had better labor market outcomes than those who did not complete the program. Similarly, earning qualifications in the program (an interim outcome measure) was positively correlated with later labor market success (the longer term outcome). Non-experimental comparisons were also used to identify differences in outcomes related to characteristics of the participants or the training experience. The field of employment and type of local agent providing the training were significant predictors of labor market outcomes. Similarly, labor market outcomes were better for youth who began the program with higher levels of motivation and past school achievement. These findings are suggestive but not definitive. Because of the non-experimental design, participating youth might have been more likely to become employed than other youth even in the absence of the program.

CHOOSING AMONG THE IMPACT DESIGNS
Choice of an impact evaluation design begins by identifying the design that both offers the strongest capacity for isolating the independent causal effects of the program and is feasible given the structure of program. The "decision tree" shown in Exhibit B illustrates a process for identifying which alternatives are feasible.

If the program will be provided to a limited number of youth who can be identified in advance. and randomly selected for participation, then an experimental design should be considered. If the program will be provided to a limited number of youth, but the decision about who receives services is determined by organizational or geographic considerations (or other nonrandom selection rules), then quasi-experimental design variations should be considered.

The most difficult design challenges occur when the program is intended to serve all members of the target population. If the new program is implemented fully and rapidly, no youth will be available for a comparison group. Often, however, new full-coverage programs-for example, new health services-are intended for an entire population but not implemented in every community in the country, and certainly not at the same time. If some communities or groups are not included in the initial implementation, it may be possible to select as comparison sites communities that have not implemented the program and use a quasi-experimental design. This may not solve the problem of comparability sufficiently to allow such a design, however, if the communities where it was implemented have characteristics that are systematically different from those where it was not.

When non-experimental designs are necessary, the following can help guide the choice of design. If a program is implemented at different levels across sites but uniformly within sites, a cross-sectional design is suitable. If a target population is exposed to different levels of the program within a community, a panel study design is better-to follow a sample of individuals, and record both outcomes and the amount of the program or intervention each individual received and when it occurred. If defining who is served by the program is difficult or the program is uniformly applied in all communities, then a time-series design is appropriate. Before-and-after designs without control groups are often used, but are subject to a number of threats to validity, including maturation and secular changes (discussed above).

Performance Monitoring

Key Elements. Performance monitoring is used to provide information on: 1) key aspects of how a system or program is operating; 2) whether, and to what extent, pre-specified program objectives are being attained (e.g., numbers of youth served compared to target goals, reductions in school dropouts com pared to target goals); and 3) identification of failures to produce program outputs, for use in managing or redesigning program operations. Performance indicators can also be developed to 4) monitor service quality by collecting data on the satisfaction of those served, and 5) report on program efficiency, effectiveness, and productivity by assessing the relationship between the resources used (program inputs) and the output and outcome indicators.

If conducted frequently enough and in a timely way, performance monitoring can provide managers with regular feedback that will allow them to identify problems, take timely action, and subsequently assess whether their actions have led to the improvements sought. Performance measures can also stimulate communication about program goals, progress, obstacles, and results among program staff and managers, the public, and other stakeholders. They

focus attention on the specific outcomes desired and better ways to achieve them, and can promote credibility by highlighting the accomplishments and value of the program.

Performance monitoring involves identification and collection of specific data on program outputs, outcomes, and accomplishments. Although they

Process for Selecting Impact Evaluation Designs
Process for Selecting Impact Evaluation Designs

may measure subjective factors such as client satisfaction, the data are numeric, consisting of frequency counts, statistical averages, ratios, or percentages. Output measures reflect internal activities: the amount of work done within the program or organization. Outcome measures (immediate and longer term) reflect progress towards program goals. Often the same measurements (e.g., number/percent of youth who stopped or reduced substance abuse) may be used for performance monitoring and impact evaluation. However, unlike impact evaluation, performance monitoring does not make any rigorous effort to determine whether these were caused by program efforts or by other external events.

Design Variations. When programs are operating in a number of communities, the sites are likely to vary in mission, structure, the nature and extent of project implementation, primary clients/targets, and timeliness. They may offer somewhat different sets of services, or have identified somewhat different goals. In such situations, it is advisable to construct a "core" set of performance measures to be used by all, and to supplement these with "local" performance indicators that reflect differences. For example, some youth programs will collect detailed data on youth school performance, including grades, attendance, and disciplinary actions, while others will simply have data on promotion to the next grade or whether the youth is still enrolled or has dropped out. A multi-school performance monitoring system might require data on promotion and enrollment for all schools, and specify more detailed or specialized indicators on attendance or disciplinary actions for one or a subset of schools to use in their own performance monitoring.

Considerations/Limitations. In selecting performance indicators, evaluators and service providers need to consider:

The relevance of potential measures to the mission/objective of the local program or national initiative. Do process indicators reflect program strategies/activities identified in mission statements? Do outcome indicators cover objectives identified in mission statements? Do indicators capture the priorities at the community level?

The comprehensiveness of the set of measures. Does the set of performance measures cover inputs, outputs, and service quality as well as outcomes and include relevant items of customer feedback?

The program's control over the factor being measured. Does the program have influence/control over the outputs or outcomes measured by the indicator? If the program has only limited influence over the outputs or outcomes being measured, the indicator may not fairly reflect program performance.

The validity of the measure. Do the proposed indicators reflect the range of outcomes the program hopes to affect? Are the data free from obvious reporting bias?

The reliability and accuracy of the measure. Can indicators be operationally defined in a straightforward manner so that supporting data can be collected consistently over time, across data gatherers, and across communities? Do existing data sources meet these criteria?

The feasibility of collecting the data. How much effort and money is required to generate each measure? Should a particularly costly measure be retained because it is perceived as critically important?

Practical Issues. The set of performance indicators should be simple, limited to a few key indicators of priority outcomes. Too many indicators burden the data collection and analysis and make it less likely that managers will understand and use reported information. At the same time, the set of indicators should be constructed to reflect the informational needs of stakeholders at all levels-community members, agency directors, and national funders.

Regular measurement, ideally quarterly, is important so that the system provides the information in time to make shifts in program operations and to capture changes over time. However, pressures for timely reporting should not be allowed to sacrifice data quality. For the performance monitoring to take place in a reliable and timely way, the evaluation should include adequate support and plans for training and technical assistance for data collection. Routine quality control procedures should be established to check on data entry accuracy and missing information. At the point of analysis, procedures for verifying trends should be in place, particularly if the results are unexpected.

The costs of performance monitoring are modest relative to impact evaluations, but still vary widely depending on the data used. Most performance indicator data come from records maintained by service providers. The added expense involves regularly collecting and analyzing these records, as well as preparing and disseminating reports to those concerned. This is typically a part-time work assignment for a supervisor within the agency. The expense will be greater if client satisfaction surveys are used to measure outcomes. An outside survey organization may be required for a large-scale survey of past clients; alternatively, a self-administered exit questionnaire can be given to clients at the end of services, In either case, the assistance of professional researchers is needed in preparing data sets, analyses, and reports.

Example. The Asociacion Salud con Prevencion (ASCP) in Colombia, South America, a non-govern mental organization which provides primary prevention services which promote adolescent reproductive health, monitors outputs with data on the number of professionals trained, the number of youth given educational services, the number of workshops held, the number of condoms distributed, and the number of medical and counseling sessions provided. The results demonstrate that the program is providing promised services, but does not give an indication of the impact in terms of either immediate outcomes such as use of birth control or longer-term outcomes (which include reduced risk of out-of-wedlock births or early childbearing).

Process Analysis

Key Element. The key element in process analysis is a systematic, focused plan for collecting data to: (1) determine whatever the program model is being implemented as specified and, if not, how operations differ from those initially planned; (2) identify unintended consequences and unanticipated outcomes; and (3) understand the program from the perspectives of staff, participants, and the community.

Design Variations. The systematic procedures used to collect data for process evaluation often include case studies, focus groups, and ethnography.

Case studies involve the detailed analysis of selected program sites or clients to determine how the program is operating, what barriers to program implementation have been encountered, what strategies are the most successful, and what resources and skills are necessary. The answers to these questions are useful in providing guidance to policymakers and program planners interested in identifying key program elements and in generating hypotheses

about program impact that can be tested in impact analyses. Case studies are sometimes used to test competing hypotheses about differences in the impact of services. This strategy is used to assess which approach is most successful in attaining goals shared by all when competing models have emerged in different locations. This requires purposely selecting sites to represent variations in elements or types of programs, careful analysis of potential causal models, and the collection of qualitative data to elaborate the causal links at each site.

Clients or sites chosen for case studies should represent wide variation in settings, program models, and clients. Identification of sample members within sites, interview topics, and key data elements begins with the logic

model as a guide. In a case study, qualitative data, collected using semi-structured interviews and observations of program operations, are often supplemented and verified by quantitative data on program operations and performance collected from records and reports.

Case studies may use several different approaches for collecting qualitative data for program evaluation. The most frequently used are semi-structured interviews, focus groups, and researcher observations while on-site. Semi

structured interviews allow for the discovery of unanticipated factors associated with program interpretation and outcomes. Protocols for semi-structured interviews contain specific questions about particular issues or program

practices. The "semi" aspect of these discussion guides refers to the fact that a respondent may give as long, detailed, and complex a response as he or she desires to the question-whatever conveys the full reality of the program's experience with the issue at hand. If some issues have typical categories associated with them, the protocols will usually contain probes to make sure the researcher learns about each category of interest.

In case studies, observations at program sites provide an important method of validating information from interviews. In this case, the observations will often be guided by structured or semi-structured protocols designed to ensure that key items reported in interviews are verified and that consistent procedures for rating program performance are used across time and across sites.

Focus groups seek to understand attitudes through a series of group discussions guided by one researcher acting as a facilitator, with another researcher present to take detailed notes. Five or six general questions are selected to guide open-ended discussions lasting about an hour and a half. The goals of the discussions may vary from achieving group consensus to emphasizing points of divergence among participants. Discussions are tape-recorded, but the primary record is the detailed notes taken by the researcher who acts as recorder. Less detailed notes may also be taken publicly, on a flip-chart for all to see, to try to achieve consensus or give group members the chance to add anything they think is important. Soon after a particular focus group, the recorder and facilitator summarize in writing the main points that emerged in response to each of the general questions. When all focus groups are completed, the researchers develop a combined summary, noting group differences and suggesting hypotheses about those differences.

Ethnography relies almost exclusively on observation and unstructured interviews to study:

Ethnography does not begin with the logic model. Its intent is to understand the program from the perspective of staff, participants, and others in the community. Ethnographers observe program operations as unobtrusively as possible, sometimes in the role of participant observer, and keep detailed field notes that are transcribed and coded to identify emerging themes and trends. The critical research goal is to provide data on the subjective experience of those in the program situation and to use this information to understand if the program goals are being achieved and, if so, how.

Ethnography uses procedures that are deliberately flexible. As a result, ethnography is helpful in gathering information on unintended consequences and unanticipated outcomes. These unexpected observations may lead to an entirely new concept of program delivery. In a recent project examining service integration programs for at-risk youth, observations helped clarify that service integration needed to go beyond formal links and on-paper agreements, and provided insights into how informal processes bonded services together in their efforts to make a difference for high-risk youth in the community.(17) Observations from ethnographic studies are perhaps the hardest type of qualitative information to analyze, since they generate volumes of information, much of which may not be directly related to evaluation goals and may not be comparable across sites.

Practical Issues. Collecting qualitative data requires skilled researchers who are experienced with the techniques being used. To analyze these data, careful notes must be taken to ensure that responses are correctly recorded and to aid in interpreting them. In methods based on interviews, interviewers must be trained to understand the intent of each question, the possible variety of answers that respondents might give, and ways to probe to ensure that full information about the issues under investigation is obtained.

Analysis of qualitative data requires an in-depth understanding of programs, respondents and responses, and especially the context in which they are evaluated. Ultimately, the analyst makes judgments regarding the relative importance or significance of various responses. This requires an unbiased assessment of whether responses support or refute hypotheses about the way the program works and the effects it has.

One way to handle qualitative data is to treat one's interview and observational notes as text, and to conduct a textual analysis using specialized computer software that can search for the presence of specific themes or content. Qualitative software is available to facilitate the location and retrieval of information from massive textual files. This kind of software is expensive to use because huge amounts of text must be entered into a computer. Further, either the exact words one wants to search for must appear in the text, or the text marked for the presence of any theme or topic that the researcher wants to retrieve. Often researchers can achieve equal or better results with carefully constructed interview or data collection guides or structured focus groups, and systematically recording of responses or coding of data encountered in the field.

Example. Case studies of two pilot projects were used for the evaluation of mentoring in the juvenile justice system conducted by Public/Private Ventures. The program was designed to match 100 mentors to at-risk youth. Mentors were trained to meet with youth one-on-one before and after the youth's release from juvenile detention facilities, with the goal of establishing an attachment to an adult role model. Data were collected from mentor logs, program records, court records, structured interviews with mentors and youth before and after program participation, staff interviews, focus groups with mentors, youth and service agency staff, and in-depth interviews with mentor-youth pairs. The qualitative analysis examined the characteristics of successful matches, issues in program implementation, the style and content of mentoring interactions, and program staffing. Although it does not offer evidence on outcomes, the evaluation provides extremely useful information on the process of implementing a mentoring program and guidance for program development and replication.

Cost Studies

Key Elements. Cost studies are used to assess investments in programs by collecting information on: 1) direct program expenditures; 2) the costs of staff and resources provided by other agencies or diverted from other uses; 3) costs for purchased services; and 4) the value of donated time and materials. Costs for the first two items usually include expenditures for staff salaries; fringe benefits; special training costs (if any); travel; facilities; and supplies and equipment that have to be purchased. The value of donated resources, which can be substantial, generally has to be estimated and requires careful documentation of the donation. Cost analyses indicate that donations are a major cost item in many youth programs. For example, the Cities in Schools (18) evaluation indicated that donations are between 74 percent and 90 percent of the total direct program costs, and that the wide variation among cities in the types of donations received made the inclusion of these costs essential to an understanding of the resources required to sustain. program operation.

The typical approach to cost studies is to calculate total program costs and then an average cost per client, calculated by dividing the total by either the total number of clients served, or the total number of clients who meet some standardized definition of success. This type of cost calculation can be linked to results of an experimental or quasi-experimental impact evaluation to estimate costs per successful client. It can also be used with performance indicators to assess the cost or cost-efficiency of achieving program goals.

A second approach to cost estimation calculates the cost per unit of service. For example, the cost per hour of classroom instruction or the cost per hour of counseling. This type of cost calculation is then used in impact evaluations (including non-experimental evaluations) to look at the costs of different outcomes. This type of cost analysis is difficult in multi-faceted, comprehensive programs in which the level and type of service are highly variable and may involve a number of service providers. It is also difficult in programs in which defining exposure to services is difficult. Where possible, it is preferable to distinguish between fixed costs (e.g., rent or the director's salary) and variable costs (e.g., the costs of special events or the hourly costs of the recreation director). The variable costs can then be used to estimate the marginal cost of adding additional clients to the number receiving a specific unit of service.

Design Variations. Cost studies can be undertaken to describe the program costs and link these to the level of outcomes achieved. In this application, the costs are compared to the level and type of outcomes documented in performance monitoring outcomes. Decisions on whether the outcomes justify the costs are based on opinions about the value of the outcomes (not monetized) and the likelihood that the outcomes are attributable to the program.

Cost-effectiveness analysis is used to compare the costs of different approaches to providing some standard level of service or desired level of outcome. This approach is most useful when multiple programs are using different models to provide a service. The requirements are that the characteristics of target populations served, the program goals, and the output or outcome measures be identical. For example, cost-effectiveness studies could compare the relative effectiveness of residential and nonresidential treatment for drug-abusing youth, provided that the youth served were similar in age and drug use problems, and that the same measures of treatment success were used.

Cost-benefit studies provide estimates of the dollar benefits returned for each dollar spent on the program-the key question from a policy perspective, but one that is not easily answered. This type of evaluation has rigorous requirements for: 1) an estimate of program costs, either per client or per unit of service; 2) estimates of the value of the benefits; and 3) comparative data on program impact-an estimate of outcomes with and without the program. The first item should be obtainable from program financial records, supplemented as needed by estimates of the cost of donated or reallocated resources. The second can be obtained from an experimental or quasi-experimental evaluation of program impact or another strategy for estimating the difference between what happened and what would have happened without the program.

The primary barrier to conducting cost-benefit analysis of service programs designed to change behavior stems from the third item: placing dollar values on benefits. Many benefits are of intrinsic value (e.g., reductions in family dysfunction and conflict) but quantifying that value is difficult.

Monetization of benefits to individuals requires assumptions about three matters, all of which are frequently controversial. First, the dollar value of the benefit may depend on personal values, for example, what residents are willing to pay for a crime-free neighborhood. Second, a dollar of benefit today is worth more than a dollar benefit realized next year. Thus, the benefits need to be time discounted, but by how much is a difficult question. Third, the beneficiaries need to be identified. Societal values become important when the beneficiaries differ in standing and perceived merit. For example, a high school equivalency degree for a violent youthful offender may result in the same gains in lifetime earnings for the offender as a violence victim would realize from physical therapy for the injury. Are they to be treated the same? To circumvent such difficult questions, the analyst may conduct a sensitivity analysis to reach conclusions based on explicit assumptions of value. For example, the neighborhood crime prevention program may be deemed cost-effective if "residents are willing to pay at least $100 per month for 10 percent lower rates of burglary" or "if the discount rate is less than 6 percent" or "if the offender's earnings are worth 50 percent of the victim's earnings."

Beyond benefits to individuals, the total value of benefits includes the social costs averted. These are the savings to the public that result from avoiding negative outcomes. These values must be based on studies that estimate the social costs of negative outcomes such as the costs of crime or drug abuse.(19)" These estimates are difficult to derive and are often based on tenuous assumptions. To compensate for problems in the reliability of estimates, cost-benefit calculations normally use a range of benefits to place an upper and a lower bound on the probable returns to investments in the program. A more significant problem is that monetary values based on public costs for the negative outcomes among the general population may be poor estimates of the value of benefits among the program's target population. For example, national estimates of the costs of drug abuse may not apply to reductions in amphetamine abuse among low-income adolescents in a single city. This problem needs to be acknowledged and value estimates revised to the extent possible to reflect savings for the program's participants. Other public benefits reflecting gains, not costs averted, are widely acknowledged, but rarely find their way into cost-benefit studies because there is no public consensus on their importance. Examples include improvements in the quality of life or the environment.

Considerations and Limitations. Documentation of gains to prevention programs is exceptionally difficult and requires estimating negative outcomes that did not occur. As described above, the most robust estimates of program impacts of this kind are based on experimental evaluations or quasi-experimental evaluations, which are difficult and expensive to conduct. When the program has total population coverage, it is possible to interpret differences between the observed trend and predicted trend in an outcome indicator over time to program impacts and estimate the monetary value of the benefits. This strategy was used to estimate the value of drug prevention efforts in the United States. National survey estimates of drug use in 1979 were used to estimate expected drug prevalence during the 1980s and early 1990s; the differences between these estimates and drug use prevalence rates based on national surveys during these years were attributed to federal investments in drug prevention programs.(20)

Practical Issues. Developing a conceptual framework that reflects all the issues in cost-benefit valuation, and then devoting the resources necessary for estimating the range of benefits, can require as much research time and expertise as determining whether the program had any impact. However, research dollars are always limited and evaluating program impact is usually the top priority, since valuing benefits is irrelevant if there is no program impact. A number of studies of the value of preventing negative outcomes among children and youth have been initiated recently. These can be expected to give program evaluators substantial help in estimating the value of reductions in youth problems for use in cost-benefit studies in the future.

Example. An evaluation of 13 delinquency prevention programs in Los Angeles County estimated cost effectiveness as a function of the delinquency risk of the population of youth served, costs, and success rate. This study compared cost to benefit ratios of alternative programs designed with a common goal and outcome measure-preventing subsequent arrest. Because the risks of delinquency varied among the youth served by different programs, estimates of the risk of delinquency was derived from existing research and used to classify the youth served by the program into four risk categories. Program costs were estimated by taking the total budgets from all sources divided by the number of clients. Costs of public expenditures for delinquency (costs to the community and justice system) were estimated from the proportion of the justice system budget (from the County budget) devoted to juvenile cases, divided by the number of juvenile cases at various stages of processing (from annual reports of the Los Angeles Probation Department, the California Youth Authority, and the U.S. Department of Justice).

The public costs averted were calculated by dividing the budget by the number of arrests of youth following program participation and calculating the savings as the difference between the two. The benefits of reductions in expected future arrests were estimated based on the probability of subsequent arrests reported in studies of criminal careers times the estimated public savings per arrest averted. Savings to victims were based on estimates of the costs of damage and loss for each type of juvenile offense from earlier research, adjusted for inflation. These costs per offense were applied to the expected lifetime arrests in the absence of the program and benefits were estimated as the difference between these costs and the absence of costs associated with no further arrests or victimization (estimating that for each arrest, there are four to five offenses that do not result in arrest). Thus, estimated program benefits were the sum of the public costs averted and the savings to victims.

The results were used to estimate the cost differential (costs divided by the value of benefits) to programs with different rates of success (measures as arrests prevented), controlling for the risk of offending of the juvenile population served. The findings were used to estimate the success rate required to show a positive rate of return given the delinquency risk of the population served for programs with different cost differentials. This estimate can be used in monitoring the performance of a wide variety of delinquency prevention programs.

Identifying Potential Evaluation Problems
A number of challenging problems face those who would apply research methods to the evaluation of human services programs. We summarize these, based on experience in reviewing and evaluating programs for vulnerable children and youth, to guide development of realistic evaluation plans. (21)
Defining Program Participation. Programs may be open-ended, lacking both formal intake procedures and policies for determining when the program is "completed." An evaluation can only yield interpretable results if participation is explicitly defined and uniformly measured. In the case of programs for vulnerable youth, for example, counselors may be contacted for several chats, followed some weeks later by an appointment, followed by intermittent participation in some, possibly not all, services offered. Youth may stop attending and then resume. Limiting participation in the evaluation to those who attend regularly is not an appropriate solution because dropping from consideration the youth who are most difficult to engage produces biased results. Often identifying who "participated" and for how long requires multiple categories to adequately reflect the variations in type, duration, and intensity of participation among the youth served. In addition, participants should be followed from the point of first contact and all major program activity documented. Evaluators also need to decide whether others who potentially benefit from the program-such as parents, boyfriends/girlfriends, or siblings-are defined as program participants. If so, their participation in program activities should also be tracked. If not, plans need to be made on how to count the gains made by these indirect program beneficiaries in evaluating program impact.

Evaluating the Relationship between Participation and Outcomes. Many programs emphasize individualized services tailored to need. In the youth services area, youth with the highest levels of risk are offered the greatest number or most intensive level of services. Obviously, assignment to treatment in this case is not random, and the multi-problem youth may never achieve the same level of positive outcomes as youth who began with fewer problems. For example, studies of the School-Based Health Centers in the U.S. show that frequent clinic users were at greater risk for alcohol and substance use, sexual activity, and poor family and peer relationships(22). Thus, comparing their outcomes to those for nonusers or those who used the clinic less frequently would be inappropriate. Similarly, comparisons between different programs must consider any differences in type and level of risk exhibited by participants. For this reason, data on the risks and needs of participants should be collected at intake for use in analysis and a pre-post design used when possible.

Defining the Unit of Analysis. Deciding on the appropriate unit of analysis can be difficult, particularly in evaluating comprehensive programs. Programs may target entire neighborhoods, classrooms, or families for change-sometimes planning activities directly for different groups, and sometimes planning carryover effects. Measurement at multiple levels is appropriate as long as each level is clearly defined. For example, crime reduction can be assessed by comparing neighborhood rates of calls for police services, household victimization rates, or youth delinquency surveys. Economic gains can be measured by changes in the area unemployment rate, average household or family income, or individual earnings. The selection should be closely linked to program goals and activities.

Evaluations of services integration programs, including most that use a case management approach, will face additional challenges in: 1) tracking the services received by participants; 2) developing common agreements among agencies on program goals and required components; 3) documenting service delivery by multiple agencies; 4) measuring effects of the service delivery system; and 5) differentiating services integration from service comprehensiveness. Each is discussed briefly below.

Tracking the services received by participants. Services integration usually involves referring participants to other agencies for needed assistance. A critical, and often difficult, problem is determining which services ' were actually received. Clients may or may not contact agencies to which they are referred, may or may not be accepted for services, and may or may not participate in services, if accepted. Documenting the chain of participation is essential to determine the extent to which services integration is being achieved, but is time consuming and often resisted by programs who see making the referral as the extent of their responsibility. Because staff turnover in service agencies is frequently high, preparing written agreements on data access and sharing is strongly recommended. In the absence of adequate agency documentation, information on service utilization can be collected in follow-up interviews with clients.

Developing common agreements among agencies on program goals and required components. The agencies collaborating in a services integration effort may differ in their vision of the program's goals, key strategies, and how youth needs will be evaluated and problems addressed. Evaluations tend to highlight these differences, which can constitute a barrier in gaining consensus on what is being evaluated. This is particularly true when multiple agencies recruit clients and/or case management services are not centralized. Time should be allocated for face-to-face meetings to get agreement on whom evaluators will count in selecting measures of program outcomes, and how service provision is expected to achieve program goals.

Documenting service delivery by multiple agencies. When many agencies coordinate and combine their resources to meet the needs of clients, one of the most difficult problems is assembling information on who received what types and amounts of service. Agencies have different methods of identifying clients. In the area of vulnerable children and youth, some use family identification numbers, others identify individual children served. Some group service records by family or child; others maintain records by contact, which introduces multiple records for single clients which then have to be checked to remove duplication. Agencies such as schools or juvenile courts can face legal or professional barriers to sharing client-based information with other agencies or evaluators. A systematic system for collecting the data needed to compile a complete picture of program participation must be developed early in the planning process and, as noted above, supported by written agreements and ongoing technical assistance and staff training in record-keeping procedures.

Measuring effects of the service delivery system. A primary goal of services integration is to change agency operations and increase effectiveness. These outcomes need to be measured at the agency, not individual, level. Evaluations of services integration need to document changes in agency procedures, increased participation in collaborative planning and service delivery, and decreases in barriers to interagency cooperation and client service associated with policies, and procedures. Referral patterns should show more diversity in planning. At the individual level, clients should report fewer unmet service needs, shorter waiting periods for service, and increased satisfaction with the response to their needs. Other evidence of integration includes increased staff knowledge and familiarity with the resources of other agencies and community groups.

Differentiating services integration from service comprehensiveness.
Services integration is intended to provide not only faster, more appropriate services, but also services that would not otherwise be available to certain clients. The referral process educates clients on the options and assistance potentially available. Improved interagency planning and coordination reduces the barriers to obtaining additional services. All this makes the task of differentiating services integration from service comprehensiveness very difficult. Evaluation and program staff need to develop clear expectations on the extent to which the ease of obtaining services and the appropriateness of the service package can be distinguished from the extent to which the program is providing comprehensive services to meet the full range of client needs.

Conclusions
Strong pressure to demonstrate program impacts dictates making evaluation activities a required and intrinsic part of program activities from the start. At the very least, evaluation activities should include performance monitoring. The collection and analysis of data on program progress and process builds the capacity for self-evaluation and contributes to good program management and efforts to obtain support for program continuation-for example, when the funding is serving as "seed" money for a program that is intended, if successful, to continue under local sponsorship. Performance monitoring can be extended to non-experimental evaluation with additional analysis of program records and/or client surveys. These evaluation activities may be conducted either by program staff with research training or by an independent evaluator. In either case, training and technical assistance to support program evaluation efforts will be needed to maintain data quality and assist in appropriate analysis and use of the findings.

There are several strong arguments for evaluation designs that go further in documenting program impact. Only experimental or quasi-experimental designs provide convincing evidence that program funds are well invested, and that the program is making a real difference to the well-being of the population served. These evaluations need to be conducted by experienced researchers and supported by adequate budgets. A good strategy may be implementing small-scale programs to test alternative models of service delivery in settings that will allow a stronger impact evaluation design than is possible in a large scale, national program. Often program evaluation should proceed in stages. The first year of program operations can be devoted to process studies and performance monitoring, the information from which can serve as a basis for more extensive evaluation efforts once operations are running smoothly.

Finally, planning to obtain support for the evaluation at every level--community, program staff, agency leadership and funder--should be extensive. Each of these has a stake in the results. Each should have a voice in planning. And each should perceive clear benefits from the results. Only in this way will the results be acknowledged as valid and actually used for program improvement.

 

Notes

1. Connell, J.P., Kubisch, A.C., Schorr, L.B., and Weiss, C.H. (1995) New Approaches to Evaluating Community Initiatives: Concepts, Methods, and Contexts. Washington, DC: The Aspen Institute.

2. Kumpfer, K.L, Shur, G.H., Ross, J.H., Bunnell, K.K., Librett, J.J. and Milward, A.R. (1993) Measurements in Prevention: A Manual on Selecting and Using Instruments to Evaluate Prevention Programs. Public Health Service, U.S. Department of Health and Human Services, (SMA) 93-2041.

3. For more information on deciding when and how to make decisions on whether and how to conduct a program evaluation, see Schmidt, R.E., J.B. Bell, and JW. Scanlon (1979), "Evaluability Assessment: Making Public Programs Work Better," Human Services Monograph Series, 14: 4-5. Washington, DC; and Wholey, Joseph S. (1994), "Assessing the Feasibility and Likely Usefulness of Evaluation." In Joseph S. Wholey, Harry P. Hatry, and Katherine E. Newcomer (eds.), Handbook of Practical Evaluation, 15-39. San Francisco: Jossey-Bass.

4. Berk, R.A., and Sherman, L.W. (1988) "Police Responses to Family Violence Incidents: An Analysis of an Experimental Design with Incomplete Randomization." Journal of the American Statistical Association 83(401):70-76.

5. Kalbfleish, J.D., and Prentice, K.L. (1980) The Statistical Analysis of Failure Time Data. New York: Wiley.

6. Rhodes, W.M. (1986) "A Survival Model with Dependent Competing Events and Right-hand Censoring: Probation and Parole as an Illustration. "Journal of Quantitative Criminology 2(2): 113-138.

7. Ellickson, P.L., Bell, R.M., and McGuigan, K. (1993) "Preventing Adolescent Drug Use: Long-Term Results of a Junior High School Program." American Journal of Public Health 83(6): 856-861.

8. See Campbell, D.T. and Stanley, J.C. (1963) Experimental and Quasi-experimental Designs for Research. Chicago: Rand McNally.

9. Campbell and Stanley (1963).

10. Loftin, C., McDowall, D., Wiersma, B., and Cottey, T.J. (1991) "Effects of Restrictive Licensing of Handguns on Homicide and Suicide in the District of Columbia." New England -journal of Medicine 325 (December 5): 1615-1620.

11. Heckman, J.J. (1979) "Sample Selection Bias as a Specification Error." Econometrica 47:153-162.

12. Joreskog, K.G. (1977) "Structural Equation Models in the Social Sciences." In P.R. Krishnaiah (ed.), Applications of Statistics, 265-287. Amsterdam: North-Holland; Bryk, A.S. and Raudenbush, S.W. (1992) Hierarchical Linear Models: Applications and Meta-Analysis Techniques. Newbury Park, CA: Sage.

13. Roosa, M.W. and Vaughan, L. (1983) "Teen Mothers Enrolled in an Alternative Parenting Program: A Comparison with Their Peers." Urban Education 18: 348-360.

14. Engle, R-F and Granger, CW.J. (1987) "Cointegration and Error Correction: Representation, Estimation and Testing." Econometrica 55: 25 1-276.

15. Barker, G. and Fontes, M. (1995) "Review and Analysis of International Experience with Programs Targeted at At-Risk Youth." Paper prepared for the World Bank.

16. Barker and Fontes (1995).

17. Chaiken, M. (1990) "Evaluation of Girls Clubs of America's Friendly PEERsuasion Program." In R.R. Watson (ed.), Drug and Alcohol Abuse Prevention, 265-287. Clifton, NJ: Humana Press.

18. Rossman, S.B and Morley, E. (1994) The National Evaluation of Cities in Schools. Report submitted to the Office of Juvenile Justice and Delinquency Prevention. Washington, DC: The Urban Institute.

19. Cohen, M. (1994) "The Monetary Value of Saving a High Risk Youth. " Draft report. Washington, DC: The Urban Institute.

20. Kim, S., Coletti, S.D., Crutchfield, C.C., Williams, C. and Hepler, N. (1995) "Benefit-Cost Analysis of Drug Abuse Prevention Programs: A Macroscopic Approach." Journal of Drug Education 25(2): 1 11-127.

21. Burt, M. R. and Resnick, G. ( 1992) Youth at Risk: Evaluation Issues. Washington, DC: The Urban Institute.

22. Barker and Fontes (1995).