CORRELATION 32. Regression.—The importance of correlated variation in individuals was emphasized by Darwin, and the correlation of deviations from the average in parents and in offspring must have long been a matter of common observation, but it was Francis Galton who, in papers leading up to his Natural In heritance (1889), first stated a law governing the frequencies of such deviations. The special aspect to which he called attention was regression towards the mean. The sons of tall men are—on the average—tall, but not so tall: the sons of short men are short, but not so short. But this does not mean that the parents of tall men are taller, or the parents of short men shorter. The regression works both ways.
Suppose, for simplicity, that heights, in both generations, are distributed about the same mean value a with the same standard deviation c, the distribution in each case being according to the normal law. Then, if we group together all the fathers whose heights are the average height of their sons (one son to each father) will not be a+ but where r is (in this case) somewhere about 0-5. And, if we group together all sons whose heights are a+n, the average height of their fathers will not be a+n but a+rn, where r has the same value as in the preceding sentence.
In the general case, suppose there are a large number of indi viduals having measurable attributes A and B, their measures being X and Y, and that the values of X are distributed about a mean value a with standard deviation c, and those of Y about a mean value b with standard deviation d; subject to an assump tion mentioned below. (In the case considered above, the indi vidual is the pair, father and son, and the attributes are the height of the father and the height of the son.) Then Galton's law of regression is that, if we group together all individuals for which X has the value a-Fxc, the mean value of Y for these in dividuals is b-Frxd; and also, if we group together all individuals for which Y has the value b+yd, the average value of X for these individuals is a-Fryc: where r is the same in the two cases and is between — I and + 1. The ratio rxd/xc = rd/c (which is
not necessarily a numerical ratio, since A and B may be different kinds of quantities) is the coefficient of regression of B on A, and the ratio ryc/yd = rc/d is the coefficient of regression of A on B.
The whole of the above is based on the assumption that the values of X and of Y are normally distributed and normally correlated. The algebraical expression of normal correlation is considered below: its general character can be seen from Table V., which is a correlation-table showing the relation between brother and sister as regards span.
33. Normal Correlation.—The form of the equation to a normal frequency-distribution was obtained (sec. 21 [iii.]) by considering the frequencies of deviations from a representative distribution in the case of alternatives A and a. The normal law of distribution of these latter frequencies having been obtained, we can easily deduce the corresponding formula for distribution in any number of categories, and thence obtain the general form ula for a normal frequency-distribution of any number of va riables. For the present we can limit ourselves to the case of two variables, which means three categories.
(i.) Suppose there are three categories P, S, T, the relative frequencies of which are p, s, t, so that a representative distri bution of n individuals would give np, ns, nt in the three cate gories. Denote these by no, n1, so that Then F and H being positive, and (iii.) By taking mean squares and of X and Y, and mean product of X and Y, it will be found that U can be expressed in the form where r= (mean product of X and Y) 4-cd. Hence it follows (by fixing X in the one case and Y in the other) that the coefficient of regression of B on A is rd/c, and that of A on B is rc/d. Thus r is the same r that we considered in sec. 32. It is called the coefficient of correlation; its value is between — i and 1. The correlation is said to be positive or negative according as r is positive or negative.