The next two subsections, "ROC Analysis" and "Dichotomization Analysis" show exactly how these two accuracy approaches can be applied to the Hershkowitz et al. (2007) data. The discussion in the next two subsections is unavoidably technical; the reader who is not interested in the technical details can skip ahead to the "Summary" subsection without any loss of continuity or understanding.
ROC Analysis
In Hershkowitz et al. (2007), study participants judged the likely validity of children's transcribed reports of sexual abuse using a five-point ordinal scale—very unlikely (VUL), quite unlikely (QUL), no judgment possible (NJP), quite likely (QL), and very likely (VL). However, no participant made any VUL judgments, so there were only four observed levels of the judgment variable, which means that it is possible to specify three nontrivial decision thresholds or cut points for classifying participant judgments as positive ("substantiation") or negative ("unsubstantiation"): (1) NJP, QL, and VL = substantiation, QUL = unsubstantiation; (2) QL and VL = substantiation, QUL and NJP = unsubstantiation; and (3) VL = substantiation, QUL, NJP, and QL = unsubstantiation.
These three thresholds can be used to draw ROC curves (Figure 1) and calculate the AUCs for non-protocol interviews (AUC = .54; 95% confidence interval [CI]: .41-.66) and protocol interviews (AUC = .81; 95% CI: .72-.91). If non-protocol and protocol interview judgments are combined into one set of 168 judgments, the AUC is .69 (95% CI: .61-.77). The AUC for non-protocol interviews does not differ significantly from the expected chance AUC of .50 (p = .58); the protocol interview AUC differs significantly from both chance (p < .001) and from the non-protocol interviews (p < .001). Confidence intervals and p-values for the AUCs were calculated using SPSS 15 and Stata 9 without making parametric assumptions about the shapes of the ROC curves.
Figure 1
Swets, Dawes, and Monahan (2000) discuss applications of ROC accuracy analysis and provide examples of empirically estimated AUC values for a variety of diagnostic tests. For example, one actuarial instrument for predicting violence risk had an estimated AUC of .76 and some tests that detect the presence of the HIV virus had AUCs greater than .90. Rice and Harris (2005) assert that an AUC of .71 or higher indicates a large effect (roughly equivalent to a correlation of r = .50 or higher when the population base rate for a condition is .50). In short, a ROC analysis of the Hershkowitz et al. (2007) data indicates that study participants showed no ability to discriminate between true and false reports of sexual abuse made in non-protocol interviews, whereas judgments about the NICHD protocol interviews showed high accuracy, according to the criteria endorsed by Rice and Harris. The overall AUC across all 168 judgments, .69, is in the moderate accuracy range, according to Rice and Harris' criteria. However, the conclusion that either the protocol interview judgments or all judgments combined were of moderate or high accuracy is problematic, as we shall see.
Dichotomization Analysis
There are many different systems in use by forensic evaluators for classifying cases of alleged sexual abuse. In most CPS investigations in the USA, most cases are classified as either substantiated or unsubstantiated, although some states allow for a three-way classification, substantiated, inconclusive, or unfounded (U.S. Department of Health and Human Services, 2007). In other forensic settings, there may be four or five probability-related classification categories available to evaluators (e.g., Dubowitz, Black, & Harrington, 1992). In general, evaluators use either one or two moderate to high probability categories that correspond well to substantiation. Most of the remaining categories are used for cases that evaluators judge to be either inconclusive or unfounded, in other words, the unsubstantiated cases. Collapsing classification categories to two, substantiated and unsubstantiated, is necessary in order to calculate accuracy statistics that depend on a dichotomous classification of judgments into hits or misses.
In the Hershkowitz et al. (2007) study, judgments of very likely (VL) or quite likely (QL) correspond well to the substantiated category, and judgments of very unlikely (VUL) or quite unlikely (QUL) correspond well to the unfounded category. No judgment possible (NJP) judgments, which comprised about one third of all judgments, appear to correspond most closely to the inconclusive category, a subset (along with the unfounded category) of the unsubstantiated category. The decision threshold for substantiation that corresponds best to the real-world dichotomy between substantiation and unsubstantiation is the one that places the cut point for substantiation between NJP and QL (VUL, QUL, and NJP = unsubstantiated; QL and VL = substantiated).
Table 2 shows accuracy statistics for each of the three nontrivial decision thresholds used in the ROC analysis. Scenario 2 places the cut point for substantiation between NJP and QL and therefore corresponds most closely to real-world classification models. For each of the three scenarios, Table 2 shows statistics for the 84 non-protocol interview judgments, the 84 protocol interview judgments, and for all 168 judgments combined. These statistics are described in the note appearing below Table 2. The false positive rate (FPR), false negative rate (FNR), and the likelihood ratio (LR) are all independent of the base rate for true allegations in the population. The other statistics shown in Table 2 vary depending on the population base rate for true allegations.
Table 2
Three Alternative Methods for Dichotomizing Evaluator Judgments in Hershkowitz et al. (2007) Base rate for true allegations .25 .50 .75 FPR FNR LR FP FN HR f FP FN HR f FP FN HR f Scenario 1. Classify as substantiation if the judgment is no judgment possible, quite likely, or very likely Non-protocol .88 .05 1.1 .66 .01 .33 .10 .44 .02 .54 .13 .22 .04 .74 .13 Protocol .76 .00 1.3 .57 .00 .43 .27 .38 .00 .62 .37 .19 .00 .81 .44 Combined .82 .02 1.2 .62 .01 .38 .19 .41 .01 .58 .26 .21 .02 .78 .28 Scenario 2. Classify as substantiation if the judgment is quite likely or very likely Non-protocol .40 .62 .9 .30 .15 .54 -.02 .20 .31 .49 -.02 .10 .46 .43 -.02 Protocol .48 .05 2.0 .36 .01 .63 .42 .24 .02 .74 .53 .12 .04 .85 .55 Combined .44 .33 1.5 .33 .08 .59 .20 .22 .17 .61 .23 .11 .25 .64 .20 Scenario 3. Classify as substantiation if the judgment is very likely Non-protocol .05 .83 3.5 .04 .21 .76 .19 .02 .42 .56 .19 .01 .63 .36 .15 Protocol .19 .33 3.5 .14 .08 .77 .45 .10 .17 .74 .48 .05 .25 .70 .41 Combined .12 .58 3.5 .09 .15 .76 .33 .06 .29 .65 .34 .03 .44 .53 .27 Note.
The exact real-world base rate for true allegations of sexual abuse is unknown and cannot be directly estimated (cf. Horowitz, Lamb, Esplin, Boychuk, & Reiter-Laverly, 1995). Furthermore, base rates are almost certainly not constant across different subgroups (for example, cases involving custody disputes vs. cases with no custody dispute). One rough estimator of the base rate for true allegations is the observed substantiation rate. The weighted mean base rate for substantiation in the subset of cases that include an interview report of sexual abuse, that is, the probability of substantiation given interview report or p(S | R), across the five studies shown in Table 1 and three additional studies that report this specific information (Keary & Fitzpatrick, 1994; Levy, Markovic, Kalinowski, & Ahart, 1995; Stroud, Martens, & Barker, 2000) is .79. However, given the strong false positive bias observed in Hershkowitz et al. (2007), .79 may be an overestimate of the actual base rate for true allegations in this subset. In other subsets, such as that consisting of uncorroborated allegations of sexual abuse that arise in the context of custody disputes (cf. Horner & Guyer, 1991a, 1991b), the base rate for true reports of sexual abuse by children may be less than .50. For comparative purposes, Table 2 shows base rate sensitive statistics for three hypothetical base rates, .25, .50 (the experimenter-set base rate for true allegations in Hershkowitz et al., 2007), and .75.
The hit rates shown in Table 2 can be compared to the base rate judgment hit rate, the hit rate that would be obtained if all judgments were the same, either all substantiated or all unsubstantiated. Thus, for population base rates .25, .50, and .75, the base rate judgment hit rates would be .75 (classify all cases as unsubstantiated), .50 (classify all cases as either unsubstantiated or substantiated), and .75 (classify all cases as substantiated), respectively.
Considering only the most realistic and generalizable scenario in Table 2, Scenario 2, it is clear that the accuracy of dichotomous judgments about non-protocol interviews, as measured by either the hit rate (HR, range = .43-.54) or the correlation between judgments (substantiated or unsubstantiated) and the real state of the world (allegation true or allegation false), (f = -.02 for all three base rates) is below or near chance levels for all three hypothetical base rates. The FPR is .40; the FNR is .62; and the LR is .9. LRs can range from 0 to infinity. LRs between 0 and 1.0 indicate that substantiation provides evidence that the allegation is false. An LR of 1.0 means that there is no association between substantiation and the validity of the allegation (equivalent to a correlation of 0). Evidence with an LR of between 1.0 and 3.0 provides only weak support for a hypothesis (Goodman & Royall, 1988; Wood, 1996). In summary, the statistics for dichotomized judgments about the non-protocol interviews are consistent with the ROC analysis in that they confirm that the study participants demonstrated no ability to distinguish between true and false reports made in non-protocol interviews. What does Table 2 reveal about the accuracy of judgments for protocol interviews?
Again considering only Scenario 2, for protocol interviews the FPR is .48; the FNR is .05; and the LR is 2.0, in the weak range. Whether or not the hit rate exceeds the judgment base rate hit rate depends on the population base rate for true allegations. For a base rate of .25, the evaluator hit rate, .63, falls .12 points below the base rate judgment hit rate. For base rates of .50 and .75, the hit rates, .74 and .85, exceed the base rate judgment hit rates by .24 and .10, respectively. The estimated proportion of all errors that are false positives, FP / (FP+FN), ranges from .75 to .97. The correlations between judgments and real-world allegation status range from.42 to .55.
Summary
The ROC analysis of the Hershkowitz et al. (2007) data indicates that (a) study participants showed no ability to discriminate between true and false reports made by children during the course of non-protocol interviews and (b) study participants were able to discriminate between true and false reports made during NICHD protocol interviews with fairly high levels of accuracy, as compared to the accuracy of clinical judgments made in other domains of psychology and medicine. However, the dichotomization analysis showed that the accuracy of NICHD protocol interview judgments is probably too low to serve as the basis for making potentially life-altering legal decisions. Of particular concern is the high false positive rate: in 48% of the evaluations of protocol interview transcripts in which the child's report was actually false, the study participants judged the (false) reports to be very likely or quite likely to be true.
The dichotomous outcome accuracy analysis for NICHD protocol interviews appears inconsistent with the ROC accuracy analysis, which indicated a fairly high level of accuracy, according to the criteria proposed by Rice and Harris (2005). However, ROC and other nondichotomous analyses can often lead to "dramatically different conclusions concerning the value of predictive devices" than analyses that take base rates and the relative costs of different types of hits and misses into account (Gottfredson & Moriarty, 2006, p. 189). Mossman (2000) provides a cogent explanation of how a diagnostic test may appear to be quite accurate in a ROC analysis, yet not be accurate enough to serve as the basis for making real-world clinical decisions. One general problem is that test validity and accuracy are relative rather than absolute concepts, and the question of whether the accuracy or usefulness of a particular test is high or low is not a strictly statistical one, but depends on (a) the benefits of different types of hits (true positives and true negatives), (b) the costs of different types of misses (false positives and false negatives), (c) the decision threshold or cut point used, and (d) the base rate for the condition that the test is designed to detect.
A specified level of accuracy that might be considered acceptable or high for one purpose might be considered unacceptable or low for another purpose. For example, a false positive rate that would be considered unacceptably high in one context (e.g., to decide whether or not a citizen's parental rights should be terminated because of child maltreatment) could be considered acceptable or even low in another context (e.g., for the purpose of screening large numbers of children for possible maltreatment, because a false positive result would lead only to further investigation, rather than to a final dispositional decision). In the current case, the most relevant accuracy statistics are those that depend on the dichotomization of judgments, because they are most closely tied to what happens in real forensic evaluations.
The finding of unacceptably low overall accuracy and high false positive rates for dichotomized judgments about the validity of allegations of CSA are consistent with earlier analyses, which have concluded that (a) error rates in CSA evaluations are high (e.g., Herman, 2005) and (b) that false positive rates are probably higher than false negative rates (Horner & Guyer, 1991a, 1991b). The conclusion that the accuracy of judgments about the validity of children's reports of sexual abuse is low is also consistent with results from a large body of empirical research on the human detection of truth and deception. In a meta-analysis of 206 deception detection studies, Bond and DePaulo (2006) found that the average rate of correct classification was only 54% (47% for false messages; 61% for true messages), only slightly higher than the expected chance accuracy rate of 50%. In the 19 studies in that meta-analysis that compared persons who are presumed to be experts in deception detection (police officers, detectives, judges, interrogators, criminals, customs officials, mental health professionals, polygraph examiners, job interviewers, federal agents, and auditors) to lay people, the experts had slightly lower (but not statistically significantly lower) average accuracy than the lay people—lay people were correct 55.74% of the time, the experts, 54.09% (p. 229).
Clinical Judgment vs. Criteria-Based Content Analysis (CBCA)
It is instructive to compare the results of Hershkowitz et al. (2007) to those of an earlier study conducted by Lamb, Sternberg, Esplin, Hershkowitz, Orbach, and Hovav (1997). Two of the authors of Hershkowitz et al. (2007), Hershkowitz and Lamb, participated in the 1997 study. The 1997 study tested the accuracy of CBCA using cases and a methodology that were similar to those used in the 2007 study, except that the validity of the children's transcribed statements was assessed through the application of the actuarial CBCA procedure, which produces a numerical score. Higher CBCA scores are supposed to indicate higher probability of veracity.
In Lamb et al. (1997), cases were classified by the researchers into five evidence-level categories based on the strength of the independent corroborating evidence. Lamb et al. did not evaluate the accuracy of the CBCA method in terms of a hit rate, but rather in terms of the correlation between the numerical CBCA scores and the five evidence-level categories, which were very unlikely (= 1), quite unlikely (= 2), questionable (= 3), quite likely (= 4), and very likely (= 5). The correlation between the level of evidence and the CBCA scores was r = .35. Lamb et al. conclude that "the results reported here … underscore that CBCA scores should not yet—and perhaps should never—be used in forensic contexts to evaluate individual statements" (pp. 262-263). Hershkowitz et al. (2007), referring to the 1997 study by Lamb et al., make a similar statement, "erroneous judgments [based on CBCA scores] were too frequent to make forensic application appropriate" (p. 101). The inability of CBCA scores to discriminate between true and false allegations was largely due to a high false positive rate. The same factors that apparently lead to a high false positive rate for CBCA may also be partly responsible for the high false positive rate in evaluators' judgments in Hershkowitz et al. (2007): "[false reports] may contain many [CBCA criteria indicative of true reports] when the central false element—an allegation of vaginal penetration, for example—is embedded in the description of an experienced event, such as a recent interaction with the alleged perpetrator" (Lamb et al., 1997, p. 262).
The accuracy of the judgments of the evaluators in Hershkowitz et al. (2007) can be analyzed using a method almost identical to the one employed in Lamb et al. (1997). That is, it is possible to use the data presented in Hershkowitz et al. to calculate a correlation coefficient reflecting the association between Hershkowitz et al.'s evidence-level classification of the cases (implausible = 0; plausible = 1) and the study participants' judgments of validity (very unlikely = 1, quite unlikely = 2 , no judgment possible = 3, quite likely = 4, and very likely = 5). Ironically, the overall correlation is exactly r = .35 (r = .11 for non-protocol interviews; r = .58 for protocol interviews).
If a correlation of .35 is too small to allow for the use of CBCA to assess the validity of children's reports, then it is also, a fortiori, too small to allow for the use of informal clinical judgments to do the same thing. Although they are willing to assert the former, Hershkowitz et al. (2007) stop short of asserting the latter. The conclusion that CBCA is not accurate enough for use in forensic contexts is relatively innocuous, since it is unlikely to seem very controversial or disturb anyone except the minority of evaluators who use CBCA; by contrast, the conclusion that the use of clinical judgment to evaluate the validity of children's reports—a method that is used by almost all forensic evaluators in decisions about thousands of children's reports each year in the United States, Israel, England, and other countries—is not accurate enough to be used in forensic contexts is likely to be extremely controversial, since it implies that thousands of well intentioned professionals are using a diagnostic method with a very high false positive error rate to evaluate tens of thousands—perhaps even hundreds of thousands—of uncorroborated reports of sexual abuse each year.
…
Limitations
There are a number of possible objections that can be raised to the generalizability of the findings and the validity of the conclusions in Hershkowitz et al. (2007) and to the validity of the reanalysis and reinterpretation of Hershkowitz et al. presented here; some of these are addressed by Hershkowitz et al. in their original report, others are not. Issues addressed in the original report include these three:
First, accuracy may have been higher if study participants had been able to see or hear recordings of the interviews, rather than only having access to transcripts. However, as Hershkowitz et al. (2007, p. 107) note, significant improvements in accuracy are unlikely, given a number of empirical studies of deception detection that have found little or no accuracy advantage for judgments based on audio or audiovisual recordings vs. judgments based on written transcripts (Bond & DePaulo, 2006; Vrij, 2000).
Second, Hershkowitz et al. (2007, p. 107) argue that the excessively high rate of positive judgments (substantiations) in their study may have been due to the fact that "the vast majority of allegations made by children in forensic contexts are plausible" or, in other words, that study participants believed that the real-world base rate for "plausible" (true? believable?) allegations based on children's reports of sexual abuse was high and that this belief may have "affected [study participants'] willingness to identify too many of the transcripts as implausible" (p. 107). This explanation is not compelling given (a) the fact that human judges often fail to take base rate information into account (Garb, 1998; Poole & Lamb, 1998; Tversky & Kahneman, 1974) and (b) even if the threshold for substantiation in Hershkowitz et al. is set very high (only very likely is counted as a substantiation, which corresponds to Scenario 3 in Table 2), the false positive rate is still quite high (.19 for protocol interviews).
Third, Hershkowitz et al. (2007, p. 107) suggest that, in real-world evaluations,
no judgment possible (NJP) classifications would probably lead to further investigation, which might lead to more correct classifications. More generally, in a real investigation, even one in which corroborative evidence is lacking, evaluators would have access to information besides the child's interview statements. For example, they might have interviews with the alleged perpetrator and other parties involved in the case, data about the context and manner in which the concern about possible sexual abuse first arose, data regarding the number of people who had talked with the child about suspected abuse before the recorded interviews occurred, as well as other case history information. It is possible that access to this additional information could improve decision accuracy across all cases (not just NJP cases). It is also possible that interviewing other parties who are convinced that a false allegation is true or that a true allegation is false might serve to reinforce an evaluator's confidence in an erroneous judgment. Numerous studies have found that the accuracy of predictions or diagnoses based on clinical judgment often fails to improve with access to additional clinical information (see Garb, 1998, for a review).
There at least three other issues that were not addressed in the original report. First, Hershkowitz et al. (2007) conducted their study in Israel. It is possible that (a) in Israel, forensic evaluators are less skilled in judging the validity of CSA allegations than in other countries, (b) that sexual abuse in Israel differs from sexual abuse in other countries in such a way as to make its diagnosis more difficult in Israel, or (c) that evaluators in Israel are more likely to believe allegations of sexual abuse than evaluators in other countries. The first hypothesis seems far-fetched, especially given the previously cited information that suggests that Israeli youth investigators appear to be better trained and more closely supervised than many of their counterparts in other countries. The second hypothesis also appears doubtful, given the marked consistency of the results of this study with previous analyses based on data from studies conducted in the USA (Herman, 2005). The third hypothesis is worthy of further investigation, given the recently published results of a study comparing college students' appraisals of the validity of allegations of CSA described in written scenarios (Nachson et al., 2007). Nachson et al. found that Israeli college students were somewhat more likely to believe sexual abuse allegations than were college students in the United States, Canada, New Zealand, or the United Kingdom.
Second, although there was strong independent evidence that confirmed or refuted the allegations in the 24 cases used as stimuli in Hershkowitz et al. (2007), the study participants were not given access to this evidence. From the participants' perspective, these were cases of uncorroborated (and unrefuted) abuse allegations. Furthermore, children apparently made verbal statements of sexual abuse in all 24 of the transcribed interviews in these cases. Thus, to the extent that the study findings are generalizable to the real world, they would generalize most directly to cases in which (a) a child reports sexual abuse in a formal investigative interview and (b) the child's report is not confirmed or refuted by independent evidence. The data from the studies shown in Table 1 suggest that these cases constitute approximately 35% of all cases of alleged CSA, a substantial minority. These are also the cases about which there is likely to be the most disagreement among professionals and which are most likely to end up resulting in criminal trials, because cases with strong corroborating evidence are more likely to terminate in plea bargains and cases with no report of sexual abuse by a child and no evidence are unlikely to be pursued. Although the findings of Hershkowitz et al. (2007) generalize most directly to this subset of cases, they are also relevant to other types of cases, including cases with weak or questionable corroborative evidence, for example, cases in which there is ambiguous medical evidence that may be suggestive of abuse (Berenson et al., 2000; Goodyear-Smith & Laidlaw, 1998; Paradise, Winter, Finkel, Berenson, & Beiser, 1999; Pillai, 2005), and to cases that are substantiated without a clear, unrecanted statement of sexual abuse by the child during formal interviews—12% of all of the cases in the studies shown in Table 1.
Third, the cases used in Hershkowitz et al. (2007) may be unrepresentative of all cases of alleged sexual abuse because only cases in which there was independent evidence that the reports were either true or false were selected. For example, corroborative medical evidence would be more likely to be found in cases involving allegations of penetration as opposed to cases involving allegations of fondling. Detailed information on the types of allegations that appeared in the selected cases is not included in the original report.
Research Recommendations
Hershkowitz et al. (2007, pp. 103, 106) asked study participants how confident they were that their judgments were correct. As in other studies of judgments about truth and deception detection (DePaulo, Charlton, Cooper, Lindsay, & Muhlenbruck, 1997), many judges were unjustifiably confident that their judgments were correct—the average confidence level across the 168 judgments was 3.9 on a 5-point scale from very unconfident to very confident, but both accuracy and the correlation between accuracy and confidence were low. The overconfidence of some evaluators can compound the damage caused by their diagnostic errors, because legal decision makers may be influenced by evaluators' high levels of confidence.
It would be interesting to see if it might be possible to reduce evaluators' overconfidence in their opinions by asking them to judge the validity of allegations using selected transcripts from Hershkowitz et al. (2007), and then providing them with feedback about their accuracy. If such an intervention could mitigate the harm caused by the dangerous combination of overconfidence and high error rates, it might be worth incorporating confidence-reducing interventions into training programs for CSA evaluators. Creating and testing such an intervention would be a worthwhile research endeavor.
It would be interesting to compare the accuracy of the evaluators in Hershkowitz et al. (2007) to the accuracy of the CBCA procedure on the same set of transcripts in order to determine if CBCA, with an appropriate cut point, would result in significantly more or less accurate classifications than those made by the evaluators in the original study.
Replication of Hershkowitz et al. (2007) with different interview samples will be quite difficult and time consuming because of the scarcity of cases in which there is strong evidence that CSA allegations are false. Hershkowitz et al. (2007, p. 101) report that Lamb et al. (1997) identified only 13 such cases in a systematic review of 1,100 cases. This does not reflect the probable rate of false allegations, but the extreme difficulty of proving that CSA never occurred. There is a marked asymmetry in the types of evidence that can be used to prove that CSA never occurred vs. evidence that can prove that it did occur. For example, a denial by the alleged perpetrator is not evidence that abuse did not occur, whereas a confession is strong evidence that it did; the fact that no one witnessed even one incident of abuse is not evidence that abuse did not occur, whereas a credible witness of a single incident is strong evidence that it did; in most cases, lack of medical evidence is not evidence that abuse did not occur, whereas some types of medical evidence provide strong evidence that it did; and so on.
Although the design and methodology used by Hershkowitz et al. (2007) are excellent, any study can be improved, especially with the benefit of hindsight. Even though replication with a new interview sample seems unlikely in the near future, replications with the same sample of interview transcripts, possibly translated into other languages, would be feasible. When replications are undertaken, there are a number of changes that may lead to minor improvements in the generalizability of results: (a) evaluators could be given access to the original audio- or video-tapes of child interviews, although, as noted above, this is unlikely to significantly improve accuracy; (b) evaluators could be given access to additional psychosocial case information such as interviews with parents and other parties; (c) the study could be replicated in other countries and/or with different types of evaluators (psychologists, psychiatrists, social workers, other physicians, law enforcement personnel, judges, potential jurors, etc.); (d) evaluators could be asked to provide a numerical estimate of the probability of truth of the allegations as well as classifying the allegation into predefined probability categories; (e) the predefined judgment categories could match real-world classification options more closely, for example, substantiated, inconclusive or no judgment possible, and unfounded; and (f) some participants could be informed of the base rate for true allegations for the study interviews in order to see if base rate knowledge improves judgment accuracy.
References
Berenson, A. B., Chacko, M. R., Wiemann, C. M., Mishaw, C. O., Friedrich, W. N., & Grady, J. J. (2000). A case-control study of anatomic changes resulting from sexual abuse. American Journal of Obstetric Gynecology, 182(4), 820-831.
Bond, C. F., Jr., & DePaulo, B. M. (2006). Accuracy of deception judgments. Personality and Social Psychology Review, 10(3), 214-234.
Cross, T. P., Finkelhor, D., & Ormrod, R. (2005). Police involvement in child protective services investigations: Literature review and secondary data analysis. Child Maltreatment, 10(3), 224-244.
DePaulo, B. M., Charlton, K., Cooper, H., Lindsay, J. J., & Muhlenbruck, L. (1997). The accuracy-confidence correlation in the detection of deception. Personality and Social Psychology Review, 1(4), 346-357.
DeVoe, E. R., & Faller, K. C. (1999). The characteristics of disclosure among children who may have been sexually abused. Child Maltreatment, 4(3), 217-227.
DiPietro, E. K., Runyan, D. K., & Fredrickson, D. D. (1997). Predictors of disclosure during medical evaluation for suspected sexual abuse. Journal of Child Sexual Abuse, 6(1), 133-142.
Dubowitz, H., Black, M., & Harrington, D. (1992). The diagnosis of child sexual abuse. American Journal of Diseases of Children, 146(6), 688-693.
Elliott, D. M., & Briere, J. (1994). Forensic sexual abuse evaluations of older children: Disclosures and symptomatology. Behavioral Sciences and the Law, 12(3), 261-277.
Fisher, R. P., Brennan, K. H., & McCauley, M. R. (2002). The cognitive interview method to enhance eyewitness recall. In M. L. Eisen, J. A. Quas & G. S. Goodman (Eds.), Memory and suggestibility in the forensic interview (pp. 265-286). Mahwah, NJ: Lawrence Erlbaum Associates.
Garb, H. N. (1998). Studying the clinician: Judgment research and psychological assessment. Washington, DC: American Psychological Association.
Goodman, S. N., & Royall, R. (1988). Evidence and scientific research. American Journal of Public Health, 78(12), 1568-1574.
Goodyear-Smith, F. A., & Laidlaw, T. M. (1998). What is an 'intact' hymen? A critique of the literature. Medical Science and the Law, 38(4), 289-300.
Gordon, S., & Jaudes, P. K. (1996). Sexual abuse evaluations in the emergency department: Is the history reliable? Child Abuse and Neglect, 20(4), 315-322.
Gottfredson, S. D., & Moriarty, L. J. (2006). Statistical risk assessment: Old problems and new applications. Crime and Delinquency, 52(1), 178-200.
Horner, T. M., & Guyer, M. J. (1991a). Prediction, prevention, and clinical expertise in child custody cases in which allegations of child sexual abuse have been made: I. Predictable rates of diagnostic error in relation to various clinical decision making strategies. Family Law Quarterly, 25(2), 217-252.
Horner, T. M., & Guyer, M. J. (1991b). Prediction, prevention, and clinical expertise in child custody cases in which allegations of child sexual abuse have been made: II. Prevalence rates of child sexual abuse and the precision of "tests" constructed to diagnose it. Family Law Quarterly, 25(3), 381-409.
Horowitz, S. W., Lamb, M. E., Esplin, P. W., Boychuk, T., & Reiter-Laverly, L. (1995). Establishing the ground truth in studies of child sexual abuse. Expert Evidence, 4(2), 42-51.
Keary, K., & Fitzpatrick, C. (1994). Children's disclosure of sexual abuse during formal investigation. Child Abuse and Neglect, 18(7), 543-548.
Lamb, M. E., Orbach, Y., Hershkowitz, I., Esplin, P. W., & Horowitz, D. (2007). A structured forensic interview protocol improves the quality and informativeness of investigative interviews with children: A review of research using the NICHD Investigative Interview Protocol. Child Abuse And Neglect, 31(11-12), 1201-1231.
Lamb, M. E., Sternberg, K. J., Esplin, P. W., Hershkowitz, I., Orbach, Y., & Hovav, M. (1997). Criterion-based content analysis: A field validation study. Child Abuse and Neglect, 21(3), 255-264.
Lamb, M. E., Sternberg, K. J., Orbach, Y., Esplin, P. W., & Mitchell, S. (2002). Is ongoing feedback necessary to maintain the quality of investigative interviews with allegedly abused children? Applied Developmental Science, 6(1), 35-41.
Levy, H. B., Markovic, J., Kalinowski, M. N., & Ahart, S. (1995). Child sexual abuse interviews: The use of anatomic dolls and the reliability of information. Journal of Interpersonal Violence, 10(3), 334-353.
Mossman, D. (2000). Commentary: Assessing the risk of violence--Are "accurate" predictions useful? Journal of the American Academy of Psychiatry and the Law, 28(3), 272-281.
Nachson, I., Read, J. D., Seelau, S. M., Goodyear-Smith, F., Lobb, B., Davies, G., et al. (2007). Effects of prior knowledge and expert statement on belief in recovered memories: An international perspective. International Journal of Law and Psychiatry, 30(3), 224-236.
Paradise, J. E., Winter, M. R., Finkel, M. A., Berenson, A. B., & Beiser, A. S. (1999). Influence of the history on physicians' interpretations of girls' genital findings. Pediatrics, 103(5), 980-986.
Pillai, M. (2005). Forensic examination of suspected child victims of sexual abuse in the UK: A personal view. Journal of Clinical Forensic Medicine, 12(2), 57-63.
Poole, D. A., & Lamb, M. E. (1998). Investigative interviews of children: A guide for helping professionals. Washington, DC: American Psychological Association.
Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: ROC Area, Cohen's d, and r. Law and Human Behavior, 29(5), 615-620.
Stevenson, K. M., Leung, P., & Cheung, K. M. (1992). Competency-based evaluation of interviewing skills in child sexual abuse cases. Social Work Research and Abstracts, 28(3), 11-16.
Stroud, D. D., Martens, S. L., & Barker, J. (2000). Criminal investigation of child sexual abuse: A comparison of cases referred to the prosecutor to those not referred. Child Abuse And Neglect, 24(5), 689-700.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124-1131.
U.S. Department of Health and Human Services. (2007). Child maltreatment 2005. Washington, DC: U.S. Government Printing Office.
Vrij, A. (2000). Detecting lies and deceit: The psychology of lying and the implications for professional practice. New York: John Wiley and Sons.
Walsh, B. (1993). The law enforcement response to child sexual abuse cases. Journal of Child Sexual Abuse, 2(3), 117-121.
Walters, S., Holmes, L., Bauer, G., & Vieth, V. (2003). Finding Words: Half a nation by 2010: Interviewing children and preparing for court. Alexandria, VA: National Center for Prosecution of Child Abuse.
Warren, A. R., & Marsil, D. F. (2002). Why children's suggestibility remains a serious concern. Law and Contemporary Problems, 65(1), 127-147.
Wood, J. M. (1996). Weighing evidence in sexual abuse evaluations: An introduction to Bayes' Theorem. Child Maltreatment, 1(1), 25-36.
Yuille, J. C., Hunter, R., Joffe, R., & Zaparniuk, J. (1993). Interviewing children in sexual abuse cases. In G. S. Goodman & B. L. Bottoms (Eds.), Child victims, child witnesses: Understanding and improving testimony (pp. 95-115). New York, NY: The Guilford Press.