34. Determination of Constants.—When the joint distribu tion of two variables X and Y is supposed to follow a certain law, we proceed in exactly the same way as for the distribution of a single variable. We find values for the constants from the data; and we examine the discrepancies between the data and a corresponding distribution deduced from chosen values of the constants, in order to see whether these discrepancies can be reasonably regarded as due to errors of random sampling.
In the case of normal correlation, where the data are dis tributed in a sufficient number of categories formed by taking values of X at equal intervals, and values of Y also at equal intervals, the values of the means and standard deviations as found from the data are taken to be those of the distribution, and the mean product of the deviations from the means is taken to be the product of the standard deviations and the coefficient of correlation. Values of the constants having been determined, we must pay attention to the "probable errors" of these values. There are, however, numerous kinds of cases in which this method cannot be adopted. We can only consider them briefly.
35. Other Kinds of Cases.—The assumptions we have hitherto made are that (i.) the measures X and Y are normally distributed and normally correlated, so that the regression is linear; (ii.) the number of observations is very large; (iii.) they are distributed into a fairly large number of cate gories, by tabulation at intervals in X and Y.
Actual cases, in which the attributes under consideration are clearly correlated, in the general sense of the term, and we should like to have some measure of the correlation, may present diffi culties, especially in one or more of the following ways.
(I) The distributions may not be normal; or, even if they are normal, the correlation may not be normal; (2) In addition, the regression may not be linear; (3) The number of cases may be small; (4) The number of categories may be very small; (5) The attributes considered may not be continuously vary ing quantities.
36. Special Methods.—The following are some special methods for treatment of cases of the kind mentioned in sec. 35.
(i) If we are dealing with quantities which have continuous variation, but the correlation is not normal, we can still define r as the ratio which the mean product of the deviations from the respective means bears to the product of the standard deviations. This applies whether the regression is linear or not.
(ii.) The definition may still hold, even if the number of cate gories is so small that we cannot determine the means, etc. Table I., for instance (sec. 6), gives sufficient data for us to find r on the assumption that the two sets of heights are normally uistributed and normally correlated.
(iii.) A common class of cases is of the kind shown in Table VI. (quoted from Yule, p. 61). Here there is clearly some cor relation ; the number of men with fair hair and blue eyes, for instance, would, if hair-colour and eye-colour were independent, be about 2829 X2811 +6800=-1169; actually it is 1768, and the discrepancy is far too great to be due to error of random sam pling. But there is nothing to suggest continuous variation of the measure of an attribute. For cases of this kind—usually called contingency cases—we must fix some definition of the ratio which is to be the measure of correlation. The usual ratio is K. Pear
son's mean square contingency coefficient, which may be defined as follows. Divide the square of the number in each compart ment of the main table by the product of the corresponding sub totals (e.g., divide 1768 X1768 by 2829X 2811), and add the results. Let their sum be S. Then the coefficient is defined as being IS — S (iv.) A particular case of contingency is association, which is the relation exhibited by a tetrachoric classification, i.e., one in which there is only a distribution of A and not-A under B and not-B. The contingency coefficient defined in (iii.) can be used for these cases ; but various other methods have been suggested.
(v.) The last class of cases to be mentioned is that in which we are concerned with two attributes A and B which can be approximately represented by magnitudes X and Y, but the number of individuals is so small, and the laws of frequency of X and Y are so doubtful, that we cannot apply ordinary statistical methods. If, however, the individuals can be arranged in order according to their values of X, and also according to their values of Y, C. Spearman's "rank" method can be adopted. Suppose, as an example, that there are n boys, and that we want to cal culate the correlation between their abilities in subjects A and B. Arrange them in order according to their ability in A, and give them the ranks 1, 2, 3...n. Do the same for B. Let d denote, for each boy, the difference between the ranks in the two subjects. Then the rank coefficient of correlation is where Id' denotes the sum of the squares of the n d's.
37. Multiple Correlation.—The methods which we have been using for dealing with correlation between two attributes can be extended to cases in which there are three or more cor related attributes. The normal formula expressing frequency of joint occurrence of deviations x, y, z . from the respective means involves an expression where U is a quadratic function of x, y, z.. . . This leads to the consideration of partial correlation, i.e., correlation between two of the variables when the remaining variables are taken to have fixed values.