Ken Richardson is a figure in the psychometric world who has made some pretty bold claims about g, heritability, and IQ testing as a whole. Out of his many (flawed) works, there is one in particular I wished to respond to. Richardson and Norgate (2015) argue that the correlation between IQ and job performance is, at best, greatly overestimated by the pro-IQ side (figures such as Ian Deary, Richard Haier, Arthur Jensen, and most notably in the job performance debate, Frank Schmidt and John Hunter) and at worst, entirely useless. Do their arguments hold up?
Validity of IQ Testing
They begin by arguing that IQ tests are an invalid measure of measuring mental ability, something Richardson has become quite popular for in his books and paper “What Do IQ Tests Test?”. Richardson’s arguments, however, hold very little merit. To begin, Richardson and Norgate note that there is no accepted theory of intelligence and hence IQ tests are not built like many other forms of measurement, such as a breathalyzer (p. 154). He uses this to extend into the argument that IQ tests cannot have construct validity as we are measuring mental ability indirectly – that is by picking and choosing items which we believe to have a relationship and testing them out.
For example, Richardson and Norgate say,
“Typically, in constructing a test, cognitive problems or items thought to engage aspects of intelligence are devised for presentation to testees in trials. Those items on which differences in performance agree with differences in the criterion are put together to make up an intelligence test. There are many other technical aspects of test construction, but this remains the essential rationale. Thus, nearly all contemporary tests, such as the Stanford-Binet or the Woodcock-Johnson tests, rely on correlations of scores with those from other IQ or achievement tests as evidence of validity.”
This is true, though Richardson and Norgate seem to fall off track in the next paragraph, saying,
“However, the question of whether such procedure measures the fundamental cognitive ability (or g) assumed has continued to haunt the field. Measuring what we think is being measured is known as the construct validity of the test—something that cannot, by definition, be measured indirectly. Generally, a test is valid for measuring a function if (a) the function exists and is well characterized; and (b) variations in the function demonstrably cause variation in the measurement outcomes. Validation research should be directed at the latter, not merely at the relation between what are, in effect, assumed to be independent tests of that function,”.
While IQ tests are mainly based around whatever the designer assumes to be a measurement of mental speed, the basis for many of these correlative predictors has been well-established. For my primary example, reaction times are considered a measure of mental speed, and so a faster reaction time is assumed to be a predictor of greater mental ability. At its face value, we are taking a guess and so the relationship between IQ tests and reaction times remains trivial as to its importance. Though, with the development of neuroscience and new technology, we are able to better validate the test of reaction times.
Richardson and Norgate caution the usage of reaction times for a couple reasons, which are not too important. The first is that the correlations are typically small, which is true, though this depends on the measure being used. As I will discuss below, more complex versions of reaction times are more predictive of IQ and, regardless, the correlations tend to be moderate at the very least. Second, they bring up the fact that reaction time tests are “subject to a variety of other factors such as anxiety, motivation, experience with equipment, and training or experience of various kinds such as video game playing”. This may explain some portion of the variance, but for the following reasons, likely doesn’t explain all of it.
Most experienced readers are familiar with electroencephalogram technology or EEG. This is essentially a machine that will read brainwaves and give the scientists looking at it an idea of the electrical currents passing through the brain. Some other important details of EEG studies will be touched on later, but for now, it is worth strongly acknowledging their relationship to reaction times. A study by Surwillo (1963) found a very large, positive correlation between simple reaction times and period of an EEG. This means that the faster that a respondent was reacting to a stimulus, the faster the brain was moving. This confirms the theory behind reaction times.
The length of a reaction time is correlated with the complexity of a task. One study, had the participants face towards a control center, which had a home button nearest the participant and eight buttons forming a semi circle around it. Occasionally, one of the outer buttons would light up. At that point, the participant would have to raise their hand off the home button and press the lit up button. The first act, raising the hand off the home button, was an act of decision whereas actually clicking the button was a movement time. The decision time, clearly being a process which requires more mental work, took about twice the time as the actual movement (see Deary, 2001).
Similarly, the more complex the form of reaction time, the greater the correlation with IQ scores. A study by Der and Deary (2017) looks at the relationship between different forms of reaction times and IQ scores over age. The two main forms of reaction times are simple reaction times and choice reaction times – the former reacting to a single stimulus, the latter having to make a choice. The more complex form, choice reaction times, share a much greater correlation with IQ, which would imply that IQ tests are measuring for greater processing power. The correlation also reached -0.44 to -0.53, which is moderate to strong in the social sciences. Finally, I’ll quickly note that reaction times are a very reliable psychometric tool, having high test-retest correspondence (Kolten et al., 2013).
Richardson and Norgate go on to dismiss biological or neurological methods of understanding g away as the results show inconsistency. This is an incredibly simplistic (even unacademic) way of viewing the literature concerning the neuroscience of intelligence. While many areas are still undecided, there are many reviews which find some consistencies. In reviewing the literature on EEG responses and IQ, Ian Deary and Peter Caryl (1997) found generally consistent results that the more intelligent someone was considered (by an IQ test), the faster the electric response in the brain to a stimulus.
Richardson and Norgate allow the brain size and IQ correlation to go entirely unnoticed, which, while there is heterogeneity in the findings, this is primarily due to different methods being used for brain size measurement. Earlier methods were more indirect (such as hat size and lead pellets), but newer methods, primarily MRI, have allowed greater replication and reliability of actual brain size estimates (see Jensen, 1998). Additionally, the better the measure of IQ, the greater the correlation with brain size (Gignac and Bates, 2017). This has allowed us to properly estimate the brain size-IQ correlation. According to a meta-analysis by McDaniel (2005), there is about a 0.33 correlation; according to a review by Rushton and Ankney (2009), this is slightly higher at 0.4. This is particularly important because brain size seems to be a measure of processing power in the brain. Haug (1987) found a correlation of r=0.48 between amount of cortical neurons and brain size. Pakkenberg and Gundersen (1997) found a correlation of 0.56 between # of neurons and brain size.
The mental efficiency theory of intelligence has a lot of supporting evidence. Speed of information processing is found to be significantly and strongly associated with IQ scores by Neubauer and Bucik (1996). Thatcher et al., (2015) find that IQ was related to greater efficiency by means of EEG. Haier et al., (1983) were able to map brain activity using Raven’s scores.
A study done by Haier et al., (1992) looked at participants using a PET scanner upon playing Tetris for the first time and after practicing for fifty days. They found less brain activity occurring while playing Tetris after a large amount of practice. This implies that people with greater ability, either in a specific ability, or even in general ability, are able to solve problems using less brain power than others. Once again, the theory of mental efficiency is legitimized by a study by Haier et al., (1995) which shows that people who are mentally retarded and have Down’s syndrome have more activity in their brain than people who are of normal intelligence.
The latter study stated there brings up another issue that critics of IQ test construct validity have to deal with which is that its very recognizable that IQ tests tell us when people are mentally handicap as well as when they are geniuses. The anti-construct-validity people such as Richardson and Norgate seem to deny that this applies to the groups in between, which, without proper evidence or some sort of explanation, is very irrational. Later on, I mention that Richardson believes that IQ is a measure of social class, but drawing from this example of how IQ tests do show genius-ness, it’s pretty much guaranteeable that Nathan Leopold’s home neighborhood was not entirely consistent of 200 IQ people.
A useful tool in the social sciences is meta analysis which allows us to get an average result, typically weighted by some factor like sample size or confidence interval, of a large sample of the available studies on any particular topic. While meta-analyses don’t fix the issue of heterogeneity in research topics, they can build confidence in some particular direction. A meta-analysis and review by Anaturk et al., (2018) compiles studies on the relationship between white matter and “cognitive activity” and finds they are significantly, positively associated. Balm (2014) explains that white matter is particularly important for the brain as it connects various pathways and allow information to travel to different areas. People diagnosed with autism have shorter pathways, indicating less connections are being made in their brain.
Global brain MRI measures such as white matter microstructure and tissue volume have a correlation with r = 0.46 with general intelligence (Ritchie et al., 2015). We also know that white matter is associated with the g portion of IQ tests because of two major analyses. Gray matter volume is also significantly associated with IQ scores (Wilke et al., 2003; Cox et al., 2019). Colom et al., (2006a) found that the g-loading of the subtests accounted for all of the relationship between full-scale IQ and gray matter accumulation in the anterior cingulate, frontal, parietal, temporal, and occipital cortices. Colom et al., (2006b) found a nearly perfect correlation between the g-loading of the subtests and the amount of gray matter associated with the subtest.
Because of the revolutions made in psychology and neuroscience over the past century, we have some (though certainly not a full) understanding of how the brain works and what different sections of it do. This allows models such as the P-FIT model to be very relevant. This model essentially explains the pathway that intelligent people process information and solve problems in their brain. Haier (2016) explains it, saying:
“The brain areas in our model represent four stages of information flow and processing while engaged in problem-solving and reasoning. In stage 1, information enters the back portions of the brain through sensory perception channels. In stage 2, the information then flows forward to association areas of the brain that integrate relevant memory, and in stage 3 all this continues forward to the frontal lobes that consider the integrated information, weigh options and decide on any action, so in stage 4 motor or speech areas for action are engaged if required. This is unlikely to be a strictly sequential, one-way flow. Complex problems are likely to require multiple, parallel sequences back and forth among networks as the problem is worked in real time.
The basic idea is that the intelligent brain integrates sensory information in posterior areas, and then the information is further integrated to higher-level processing as it flows to anterior areas. The PFIT also suggests that any one person need not have all these areas engaged to be intelligent. Several combinations may produce the same level of general intelligence, but with different strengths and weaknesses for other cognitive factors. For example, two people might have the same IQ, or g level, but one excels in verbal reasoning, and the other in mathematical reasoning. They may both have some PFIT areas in common, but it is likely they will differ in other areas.”
Some reliable evidence for this model comes from a study using the large UK BioBank sample by Cox et al., (2019). They used confirmatory factor analysis and principal component analysis to extract g from four main tests of ability, then put it in an analysis with different areas of the brain and total brain volume. From this, they found significant, positive relationships between g and brain volume, gray matter volume, white matter volume, and a general factor of white matter fractional anistropy. Compared to brain volume, the P-FIT model was better fit in explaining g. All of the prior evidence combined should lead us to believe a neuro-g is accessible and that IQ tests are tapping into mental ability, giving them construct validity. So, Richardson and Norgate are entirely wrong on this issue.
Predictive Validity of IQ
Richardson and Norgate’s next arguments are against the predictive validity of IQ testing. They first question the causality behind the relationship behind the IQ and educational achievement correlation. This is because, as they believe, the relationship is an artifact of test construction. Richardson and Norgate say,
“Rather they are merely different versions of the same test. Since the first test designers such as Binet, Terman, and others, test items have been devised, either with an eye on the kinds of knowledge and reasoning taught to, and required from, children in schools, or from an attempt to match an impression of the cognitive processes required in schools. This matching is an intuitively-, rather than a theoretically-guided, process, even with nonverbal items such as those in the Raven’s Matrices.”
Richardson and Norgate first refer to a couple of quotes which support their argument that the correlation between IQ and educational achievement is an artifact of test construction. Particularly they say themselves that, “a correlation between IQ and school achievement may emerge because the test items demand the very kinds of (learned) linguistic and cognitive structures that are also the currency of schooling” and cite a quote from Thorndike and Hagen which says, “From the very way in which the tests were assembled [such correlation] could hardly be otherwise”. But, this would require the correlation between verbal intelligence and school achievement to be substantially greater than that of performance intelligence and school achievement. According to a study by Hartlage and Steele (1977), which looked at the WISC and the WISC-R, the correlations are all very close aside from the correlation between Reading achievement and Verbal IQ in the WISC-R. Every other correlation, though, runs contrary to Richardson and Norgate’s hypothesis.
Richardson and Norgate provide little evidence that the correlation must be due to the tests being partially informed by knowledge taught in schools. Aside from some quotes, one of the only pieces of evidence they provide is the observation that the correlation between IQ and educational achievement becomes larger with age, though they fail to explain how this proves their argument. It could easily be viewed through a pro-IQ lens: the correlation is based off a slope on a scatterplot, and so, as people age and succeeding in education becomes more difficult, it requires a great deal more cognitive ability, so the slope becomes steeper and therefore the correlation increases.
After incorrectly using the age interaction as an argument, Richardson and Norgate provide the argument that parental drive correlates with IQ. But, the effects of parental drive seem to disappear by adulthood whether looked at through a genetic lens (Bouchard, 2013) or through an environmental lens (Dickens and Flynn, 2001). This is because the effect of shared environment on phenotypic IQ almost entirely disappears by adulthood. Any “effect” of parental drive on IQ by adulthood is easily reducible to the gene-environment correlation – higher IQ parents, being more motivated and more caring for their kids outcomes, motivate their kids more in addition to putting them in better schools. The large genetic correlation between educational achievement and intelligence (e.g Luo et al., 2003) offers some doubt on Richardson and Norgate’s theory as well. Putting all of this together, Richardson and Norgate make no convincing argument that the correlation between educational achievement and IQ is an artifact of test construction.
Richardson and Norgate’s argument to deconstruct the relationship between IQ and occupational level and income are entirely contingent on their argument about IQ and educational achievement being accurate. Since, as I have shown, it is not, they have little ground to stand on and their arguments about predictive validity thus far are not substantiated. Finally, they segue into the main aspect of this article: the relationship between IQ and job performance.
Doubts About These Studies
After a review of the main evidence cited in favor of a strong IQ-job performance correlation, they turn to a basic note about meta-analysis before moving into a full rebuttal. Their argument stems from a general criticism of meta-analysis. They state,
“Generally, meta-analyses are rarely straightforward and, at times, have been controversial. Although undoubtedly useful in many subject areas, as Murphy (2003) says, they are often viewed with distaste because they mix good and bad studies, and encourage the drawing of strong conclusions from often-weak data. In the IQ-job performance studies in question, quality checks are often difficult because the original reports were unpublished, sometimes with parts of original data lost.”
While this is true, but, as Richardson and Norgate earlier noted, the meta-analyses are designed to give weight to the best studies. This is particularly done through error weighting and n-weighting. Hunter and Schmidt (2004) states,
“That is, the weight for each study is the product of two factors: the sample size N(sub)i and the square of the artifact attenuation factor A(sub)i. . . . This weighting scheme has the desired effect: The more extreme the artifact attenuation in a given study, the less the weight assigned to that study. That is, the more information contained in the study, the greater its weight.”
Hence, while meta-analysis is not at all perfect, it is certainly a useful tool and can be balanced to give the best studies the most weight.
The Many Surrogates of IQ Tests
Richardson and Norgate’s next criticism can be amounted to “the tests included in meta-analyses are of a wide variety and may be measuring different factors”. Some of the examples they give are smaller such as “working memory tests, reading tests, scholastic aptitude tests (SATS) and university admission tests, all taken in meta-analyses as surrogate measures of IQ.” This causes no real problem since these tests (though less so today), do represent cognitive ability in some sense, need it be “indirect”.
They also point to a meta analysis which uses tests that are explicitly separated as “g-tests” and “batteries” arguing that this distinction implies a measurement of different abilities. While this may seem odd, the truth is that the tests labelled as batteries are equally reliable measures of g. I was able to find studies on some of the batteries listed. The DAT was found to be a reliable measure of cognitive ability and g by Nijenhuis, Evers, and Mer (2000). A simple overview of the subtests in the DAT should allow one to see they are measuring similar abilities to those in “g-tests”. Another battery example, the GATB, has a very well-established record as a g-loaded test. Analyses of the GATB include Jensen (1985), van der Flier & Boomsma-Suerink (1994), Evers, van Vliet-Mulder, and Ter Laak (1992), (both of which are cited within a paper by Nijenhuis and Flier ), also see Hartigan and Wigdor (1989) for a review of the validity and psychometric quality of the test. I’ll note that Richardson and Norgate later cite the previous review, yet treat the entire chapter dedicated to showing psychometric validity of the GATB as if it were non-existent. An analysis of the Wilde Intelligence Test (Lippens, 2016) finds it highly correlates with other measures of mental ability, making it, possibly indirectly though likely directly, a measure of intelligence (though I’ll note this study recommends future validity testing). So, the distinction between battery and g-test, though off-putting, seems to not mean much towards the validity and psychometric qualities of the tests.
Richardson and Norgate note the Flynn Effect may have some effect on the reliability of the correlation. The Flynn effect does not occur on g though (Nijenhuis and van der Flier, 2013), and g appears to be the driving force behind the IQ-job performance correlation (Lee, Earles, and Teachout, 1994).
Richardson and Norgate’s next section attempts to show that the most commonly used assessment of job performance is supervisory ratings and, because of the bias associated with supervisory ratings, these are unstable.
First, I’ll note that, yes, most studies are centered around supervisory ratings. However, many aren’t. And when put up against each other, the correlations between IQ and various measures of job performance are consistently high. A review by Schmidt and Hunter (2004) display the meta-analyses in a table below; two of these, McHenry et al., (1990) and Ree et al., (1994) are done with job sample tests, which Richardson and Norgate consider to be a “more objective criteria”, and have correlations which do not differ from the rest of the samples.
Richardson and Norgate elaborate later that there is some controversy as to how much we should correct for reliability in supervisory ratings. This is a fair point, but this just inflates the effect of objective work samples, meaning that the correlation seems to be higher for the type of job performance test that Richardson and Norgate prefer.
They list a number of biases which affect the supervisor’s decisions and which may potentially exaggerate the correlation (though as we can see above, apparently not). First, they note that “halo” effects exist which may bias the correlation, but an analysis by Viswesvaran, Schmidt, and Ones (2002) developed a method to control for this and found that general mental ability was still a strong predictor of job performance. They note that supervisory ratings are biased by height, facial attractiveness, and race/ethnicity, all of which correlate with IQ in such a way that may confound the correlation. However, there are some caveats.
The Height x Rating as well as Attractiveness x Rating correlations do exist, though there are a couple points to make. First of all, taller people do seem to be objectively better at their jobs generally and are good at advancing, likely due to a greater sense of self esteem (see Rosenberg, 2009). The effect of height on wages (which is correlative of supervisory rating) is also non-linear and exponential, meaning the effect is primarily among the tallest people (Kim and Han, 2017). One would have to linearize the data and control for its effect on supervisory ratings, while also taking objective supervisory ratings into account to ensure actual bias is occurring. This altogether means it would not be, by any means, impossible, but at least more difficult to control for, and likely for little reward. Facial attractiveness seems to have a small association (Cohen’s d of less than 0.5) with work related outcomes, meaning the effect of reducing bias would likely be small (Hosoda et al., 2006)
When it comes to racial bias in supervisory ratings, the main analysis people point to is that by Stauffer and Buckley (2005). This analysis does show racial statistical bias in supervisory ratings, however they warn that this can not be reduced to prejudice. At the very least, they note that both blacks and whites agree that white workers are typically better at their jobs than black workers are – it just comes down to what degree. Perhaps a more fundamental question would be: which group is correct? A meta-analysis by McKay and McDaniel (2006) looks at studies on both objective and subjective ratings of job performance in blacks and whites. Objective measures closer favored that of the white supervisory ratings. Interestingly, a study by Roth, Huffcutt and Bobko (2003) finds that objective measures of job performance actually predict a larger racial difference in job performance than do subjective measures. This would mean that the effect of racial bias in supervisory ratings actually underestimates the job performance and IQ correlation.
So, from all of the above, it appears that supervisory ratings are not biased in a way to largely undermine the IQ and job performance correlation. This is shown by the fact that objective measures of job performance such as work sample tests correlate just as well as do subjective measures of job performance. It is also shown by the lack of bias, or at least the type of bias Richardson and Norgate suppose is present in supervisory ratings. As I said above, the effect of race actually seems to deflate the correlation.
Corrections for Sampling Error, Measurement Error, and Range Restriction
Richardson and Norgate’s more concrete criticisms come through their attack on how much Hunter and Schmidt (and others) are correcting for different details in their meta-analyses. While some (Kaufman and Kaufman, 2015) have agreed that the job performance corrections are likely too large, Richardson and Norgate definitely over-estimate the degree to which this is true.
Richardson and Norgate’s main arguments are as follows:
- Correcting for sampling error is much more complicated when much of the data is missing, as we have to assume that the samples are random. There appears to be non-randomness to the type of samples acquired in these studies. Additionally, because of the time that sampling error is corrected and the unavailability of full data, the estimate of sampling error might be overestimated. They point primarily to an analysis by Hartigan and Wigdor which found the sampling error in the studies was closer to 50% rather than 70%.
- The estimates of test reliability given by Hunter and Schmidt are likely too low. Correcting for measurement error may also cause some issues, such as increasing the standard error with larger confidence intervals. There is also uncertainty created by the disagreement among raters about what aspects of the job should be valued. Some measurement error is ignored because of the supposed non-construct-validity of IQ tests.
- Finally, Hunter and Schmidt’s analyses make assumptions about the SDs of the job applicants’ test results in order to correct for range restriction, which they rely on the Hartigan and Wigdor’s analysis to show is flawed.
- Some newer analyses find lower rates of error in the job performance-IQ correlation as well as smaller correlations.
As I said, these critiques hold some weight, but not as much as Richardson and Norgate think.
There are some studies which, at least partially, disagree with 1. Brannick and Hall (2001), though they argue that Hunter and Schmidt overestimate the proportion of variance in meta-analyses due to sampling error, find that making a statistical correction for this only results in a reduction in bias in small samples of studies. If we concede that the sampling error is somewhat smaller than that used by Hunter and Schmidt, it is only one small piece to the equation.
It is true that confidence intervals may be increased by correcting for measurement error, but the end resulting confidence intervals are not out of bounds for reliable research (see Hunter and Hunter, 1984). In response to the argument in 2 that interrater reliability is not high (which they use the Hartigan and Wigdor analysis for), it has been shown by Salgado, Viswesvaran and Ones (2014) that Hartigan and Wigdor miscalculated the interrater reliability. They say:
“Recent results by Rothstein (1990), Salgado and Moscoso (1996), and Viswesvaran, Ones and Schmidt (1996) have shown that Hunter and Hunter’s estimate of job performance ratings reliability was very accurate. These studies showed that the interrater reliability for a single rater is lower than .60. If Hunter and Hunter’s figures were applied to the mean validity found by the panel, the average operational validity would be .38, a figure closer to Hunter and Hunter’s result for GMA predicting job performance ratings.
A fifth meta-analysis was carried out by Schmitt, Gooding, Noe and Kirsch (1984) who, using studies published between 1964 and 1982, found an average validity of .22 (uncorrected) for predicting job performance ratings. Correcting this last value using Hunter and Hunter’s figures for criterion unreliability and range restriction, the average operational validity resulting is essentially the same in both studies (see Hunter & Hirsh, 1987).
Meta-analysis of the criterion-related validity of cognitive ability has also been explored for specific jobs. For example, Schmidt, Hunter and Caplan (1981) meta-analyzed the validities for craft jobs in the petroleum industry. Hirsh, Northrop and Schmidt (1986) summarized the validity findings for police officers. Hunter (1986) in his review of studies conducted in the United States military estimated GMA validity as .63. The validity for predicting objectively measured performance was .75.”
In regard to that last paragraph, it should be brought up that Richardson and Norgate ignore the meta-analyses on work sample tests and GMA, which I brought up earlier.
The argument presented in 3, regarding range restriction, is the least compelling. Sackett and Ostgaard (1994) reply to Hartigan and Wigdor’s analysis, which excluded a control for range restriction, by empirically estimating the standard deviations for applicants of a wide range of jobs. Based on this analysis, they argued that Hartigan and Wigdor wrongly excluded their correction for measurement error as it would lead to a much larger downward bias than the upward bias created by Hunter and Schmidt. Finally, Hartigan and Wigdor view their analysis as a positive replication of Hunter and Schmidt’s work, making it slightly misplaced as a major threat to previous analyses.
Finally, Richardson and Norgate note some analyses, primarily that done by Hartigan and Wigdor, that show the amount of error in studies on job performance and IQ has decreased, as well as the final correlations. Viswesvaran, Ones and Schmidt (1996) criticize Hartigan and Wigdor’s refusal to correct for measurement error and range restriction. This can also be applied to Richardson and Norgate’s later argument about job complexity and IQ:
“The results reported here can be used to construct reliability artifact distributions to be used in meta-analyses (Hunter & Schmidt, 1990) when correcting for unreliability in the criterion ratings. For example, the report by a National Academy of Sciences (NAS) panel (Hartigan & Wigdor, 1989) evaluating the utility gains from validity generalization (Hunter, 1983) maintained that the mean interrater reliability estimate of .60 used by Hunter (1983) was too small and that the interrater reliability of supervisory ratings of overall job performance is better estimated as .80. The results reported here indicate that the average interrater reliability of supervisory ratings of job performance (cumulated across all studies available in the literature) is .52. FurthermoVe, this value is similar to that obtained by Rothstein (1990), although we should point out that a recent large-scale primary study (N = 2,249) obtained a lower value of .45 (Scullen et al., 1995). On the basis of our findings, we estimate that the probability of interrater reliability of supervisory ratings of overall job performance being as high as .80 (as claimed by the NAS panel) is only .0026. These findings indicate that the reliability estimate used by Hunter (1983) is, if anything, probably an overestimate of the reliability of supervisory ratings of overall job performance. Thus, it appears that Schmidt, Ones, and Hunter (1992) were correct in concluding that the NAS panel underestimated the validity of the General Aptitude Test Battery (GATE). The estimated validity of other operational tests may be similarly rescrutinized.”
Taking this into consideration, since the Hartigan and Wigdor analysis, other analyses have come out with even greater amounts of studies. Kuncel, Ones, and Sackett (2010) summarize some findings, primarily from a meta-analysis of over 21,000 studies and a sample of over 5,000,000 people. This of course, gets entirely overlooked by Richardson and Norgate:
In fact, the validity of cognitive ability tests in predicting overall job performance is about .50. This is the major finding summarizing results across over 20,000 primary studies including data from over five million individuals (Ones et al., 2004a). . . .
Ones, Viswesvaran, and Dilchert (2005a) provide a thorough accounting of the meta-analytic validities for cognitive ability tests in work settings. Here we summarize the general findings across the meta-analyses reviewed by Ones and her colleagues. First, cognitive ability tests predict how well employees learn in job training. Across jobs and industries, the relationship between ability test scores and training performance is in the .60s. However, the higher the complexity of the job and thus the complexity of the job knowledge to be acquired, the higher the validities are. Second, cognitive ability tests predict actual job performance very well. As we have already noted, the validity for tests of general mental ability are in the .50s across jobs (for overviews of the existing evidence see Ones et al., 2004a,b, in press). Confirming the importance of cognitive ability as a determinant of complex work performance, validities are higher in high complexity jobs – in the upper .50s to .60s range. However, we should note that even for the lowest complexity jobs, validities tend to be substantial – in the .30–.40 range. Individuals who score higher on cognitive ability tests perform better on their jobs. There is evidence from large samples that cognitive ability tests are even useful in predicting rule compliance and avoidance of detected counterproductive work behaviors (e.g., theft, security violations, aggressive behaviors (Dilchert, Ones, Davis, & Rostow, 2007) and rule compliance (Mount, Oh, & Burns, 2008).
In addition to overlooking work sample tests, Richardson and Norgate ignore other forms of job outcomes/performance. As Salgado, Viswesvaran and Ones (2014) state:
“GMA also predicts criteria other than just job performance ratings, training success, and accidents. For example, Schmitt et al. (1984) found that GMA predicted turnover (r = .14; n = 12,449), achievement/grades (r = .44, n = 888), status change (promotions) (r = .28, n = 21,190), and work sample performance (r = .43, n = 1,793). However, all these estimates were not corrected for criterion unreliability and range restriction. Brandt (1987) and Gottfredson (1997) have summarized a large number of variables that are correlated with GMA. From a work and organizational psychological point of view, the most interesting of these are the positive correlations between GMA and occupational status, occupational success, practical knowledge, and income, and GMA’s negative correlations with alcoholism, delinquency, and truancy. Taking together all these findings, it is possible to conclude that GMA tests are one of the most valid predictors in IWO psychology. Schmidt and Hunter (1998) have suggested the same conclusion in their review of 85 years of research in personnel selection.”
The Job Complexity Issue
Richardson and Norgate argue that the correlations between IQ-job performance and job complexity are overexaggerated by Hunter and Schmidt’s previous analyses. They primarily argue this by means of the Hartigan and Wigdor analysis, which, as I have stated earlier, is largely an underestimate. Richardson and Norgate also bring up how psychological variables may be confounding the correlation, such as self esteem as well as the fact that people in jobs of lower complexity communicate less with their managers. While this is true, there are, once again, caveats. If the latter criticism were a major issue, the relationship between job complexity and the IQ-job performance correlation would not persist on to work sample tests as seen in an analysis by Salgado and Moscoso (2019).
Another reason to be wary of Richardson and Norgate’s criticism is that people who are higher in intelligence do tend to have higher self esteem, but this did not translate to greater confidence in job ability (Lynch and Clark, 1985).
Additionally, Richardson and Norgate ignore the fact that the correlation between IQ and educational achievement increases at higher levels (Arneson, Sackett and Beaty, 2011). Since this is true, it is very likely that IQ becomes more significant for other complex tasks outside of the biases present in the workplace.
Hence, both issues seem to be made larger by Richardson and Norgate than they actually are.
Of Correlations and (Non-Cognitive) Causes
Finally, Richardson and Norgate attempt to deconstruct the argument that mental ability is driving the correlation. Most of these arguments are based around ignoring contrary evidence and their earlier incorrect dismissal of construct validity. First they cite, a study which shows that job knowledge tests mediated the relationship between IQ and job performance. This is fallacious for the same reason that using socioeconomic status to control for the relationship between race and IQ is. Since IQ predicts job knowledge (as well as job training results), the closing of the relationship may be entirely artifactual.
In response to this, Richardson and Norgate simply turn back to calling g an “uncharacterized concept” despite its consistent validity (Dalliard, 2013). If cognitive ability tests were merely a measurement of social status, and that the predictive power of IQ are built into the test because of the test measuring social status, as Richardson (2002) asserts, then we would not see, firstly, major differences between siblings in IQ, and secondly, equal predictive power within families compared to between families (Herrnstein and Murray, 1994; Murray, 2002; Frisell et al., 2012; Aghion et al 2018; Hegelund et al., 2019) Additionally, as has been found by Kuncel et al., (2014), the relationship between job performance and cognitive ability is not mediated by SES.
Ironically, Richardson and Norgate later refer to emotional intelligence as a strong predictor of job performance, despite the fact that it is much more ill-defined than g (Murphy, 2006). Emotional stability may play some role, but this is likely because of its psychological link to intelligence (see Dean, 2018).
Richardson and Norgate also point to a study which shows that high performers on Wall Street who switched firms suffered a decline in performance. However, pushing this as a major argument seems detached from reality – the pure existence of this decline does not mean that intelligence doesn’t remain an important factor. Taking the high performers of Wall Street doesn’t really tell us anything about the American population either.
Finally, they point out some effects of anxiety and motivation on test taking. The relationship between motivation and test scores are actually in the opposite direction that Richardson and Norgate believe (Reeve and Lam, 2007a). A study by Gignac et al., (2019) finds that motivation had a modest correlation with IQ, but the effect was non-linear and entirely centered in the low-moderate levels of intelligence. The effect of anxiety on IQ scores in general seems to be up to debate, as Jensen (1980) not only brings up some issues with the measures of anxiety, but also the relationships reported in the literature:
“There is a considerable literature on the role of anxiety in test performance. The key references to this literature are provided in reviews by Anastasi (1976, pp. 37-38), Matarazzo (1972, pp. 439-449), I. G. Sarason (1978), S. B. Sarason et al. (1960), and Sattler (1974, p. 324). In brief, many studies have reported generally low but significant negative correlations between various measures of the subject’s anxiety level, such as the Taylor Manifest Anxiety Scale and the Sarason Test Anxiety Scale, and performance on various mental ability tests. Many nonsignificant correlations are also reported, although they are in the minority, and are usually rationalized by the investigators in various ways, such as atypical samples, restriction of range on one or both variables, and the like (e.g., Spielberger, 1958). I suspect that this literature contains a considerably larger proportion of “ findings” that are actually just Type I errors (i.e., rejection of the null hypothesis when it is in fact true) than of Type II errors (i.e., failure to reject the null hypothesis when it is in fact false). Statistically significant correlations are more often regarded as a “ finding” than are nonsignificant results, and Type I errors are therefore more apt to be submitted for publication. Aside from that, sheer correlations are necessarily ambiguous with respect to the direction of causality. Persons who, because of low ability, have had the unpleasant experience of performing poorly on tests in the past may for that reason find future test situations anxiety provoking—hence a negative correlation between measures of test anxiety and ability test scores”
Reeve and Lam (2007b) find the effect of practice was negatively associated with g-saturation as well. So, all in all, the arguments that Richardson and Norgate bring up about the validity of general mental ability are very flawed.
Their Summary and Conclusion
After providing a discussion of their argument and the implications of the IQ-job performance correlation, they provide a short summary of their points. Here are their key points and my key responses:
- Much in developmental theory, and psychology in general, depends upon the validity of IQ tests.
This is correct.
- In the absence of agreed construct validity this has weighed heavily on indirect validity using correlations with criterion outcomes among which job performance has a special status.
There is significant evidence, especially through neuroscience and reaction times, that we are tapping into mental ability with IQ tests.
- Hundreds of studies prior to the 1970s reported low and=or inconsistent correlations between IQ and job performance.
Yes, and when corrected through standard meta-analytical procedures, which are the best we have, are much higher.
- These correlations have been approximately doubled using corrections for supposed errors in primary results and combining them in meta-analyses. Such corrections have many strengths, theoretically, but are compromised in these cases by the often uncertain quality of the primary studies.
The level of compromise is actually not as large as Richardson and Norgate make it out to be. Their claims are largely based on a selective analysis by Hartigan and Wigdor, but this analysis made wrongful assumptions about measurement error and range restriction.
- The corrections to sampling errors, measurement errors, and to range restriction have required making a number of assumptions that may not be valid and have created a number of persistently contentious issues.
- The claim that the IQ-job performance correlation increases with job complexity is not born out in more recent studies.
These studies were not corrected for measurement error and range restriction and the relationship remains in some newer analyses as well.
- A range of other—including noncognitive— factors could explain a correlation between IQ and job performance, and even constitute part or all of the enigmatic ‘‘general factor.’’
Examples of these fail to explain the general factor or are fallacious at their core.
- There remains great uncertainty about the interpretation of IQ-job performance correlations and great caution needs to be exercised in using them as a basis for the validity of IQ tests and associated concepts.
They are not the only used basis for predictive validity as many meta-analyses have been taken underway showing the consistent role in IQ in just about every major success outcome (Strenze, 2015). As shown above, the degree to which Richardson and Norgate believe the relationship is overestimated is far too large. The Hunter and Schmidt correlations are evidently overestimated to some extent but are far more reliable than those proposed by Richardson and Norgate.
I made a number of edits after the original version of this post. This was primarily because Emil Kirkegaard sent me a significant amount of sources I could read through and because I had read more the following night. Originally, this post was far less critical of Richardson and Norgate. I conceded some things I shouldn’t have and was far more forgiving of their mistakes. But, I knew that Richardson was disingenuous and I shouldn’t have given him the benefit of concession, primarily on something like construct validity.
This note is here in case I add anything else I find along the way as I want this post to be as solid as it can be, given the burden of proof is on me to respond to Richardson and Norgate.