BIOMETRY, a word to which three somewhat different meanings may be given. (a) Just as the word geometry, origi nally denoting land-measurement, has come to refer strictly to the mathematical study of the properties of space, so biometry may be taken to mean the mathematical study of the properties of populations, a subject otherwise designated as mathematical, or theoretical, statistics. This meaning, though too limited, gains in appropriateness from the fact that many of the great modern advances in statistical theory have in fact been developed in the study of biological material. (b) A more comprehensive mean ing would include not only the mathematical theory but the ex perimental technique of, and the results obtained by, the applica tion of quantitative methods in biology. This meaning will be adopted with the reservation that, since biological researches of all kinds, as they become more exact and detailed, must also become quantitative in character, we may ignore the large class of researches into living or organic materials, by quantitative physical or chemical methods, which do not involve the peculiar characteristics of populations, as opposed to individuals, or re quire statistical methods in their interpretation. (c) The third meaning is of historical interest only. In the early years of the loth century the term was applied to the work of a group of in vestigators who held that heredity could better be studied by mass observations and correlation coefficients than by Mendelian analy sis, by means of the frequency ratios obtained by experimental breeding. The progress of research has cleared away the causes of this controversy, and Mendel is now recognized as a pioneer in the introduction of statistical methods in biology.
In treating so large a subject in a limited space it will be necessary to omit on the one hand the advanced mathematical development of statistical theory, and on the other the descrip tion of the innumerable practical applications of the systematic measurement of living things, ranging as these do from the mass production of clothing to the practical improvement of live stock. Attention will therefore be concentrated upon illustrating the fundamental arguments and methods of procedure by which progress has actually been achieved.
The primary purpose of biometrical methods is to overcome the obstacles to exact reasoning which arise from the variability of biological material. This is invariably accomplished by study ing the frequency of occurrence of the different possible forms, or of the different possible types of response to treatment, etc. This method is at its simplest when there are only a few or even two possibilities to be enumerated, as when live births are classi fied as those of male or female children. The statement that 51% of such births are of males is thus of the biometrical type at its simplest, in that it expresses the frequency ratio of one possibility of. a variable event. Mendel's discovery of the laws of inheritance was due to the fact that in matings from which the offspring could be of two or more distinct kinds, he took the revolutionary step of ascertaining from a sufficiently large count just what the fre quency ratios actually were. In this way he found the simple ratios :1 and 3 :1 characteristic of differences dependent upon only a single factor, in addition to the more complex ratios ap propriate to two or more independent factors. The Mendelian method of studying heredity lay in the experimental determina tion and interpretation of frequency ratios, and it is noticeable that each great advance from Mendel's position has been achieved by the same method.
More often the biological variation observed cannot be de scribed in terms of a few distinct classes. A quantity (technically a variate) such as human stature can take any exact quantitative value in a considerable range of variation. Such measurements may be grouped in artificial classes; e.g., all statures from 651 to 664-in. may be considered as one class. The frequency of occur rence in each such class may be observed, and the frequency dis tribution so obtained affords an adequate description of the par ticular variate in question. For example, the table below gives the measurement of chest girth for 1,126 recruits aged 18, ob tained by No. medical board at Liverpool:— The recognition of the value of such frequency distributions was principally due to Francis Galton. They are now invariably employed as a first step in the study of any biological phenomenon showing so-called continuous variation. The kind of information they provide may be seen at once by constructing a frequency histogram, in which the different measurements in the range of variation are indicated on a horizontal scale, and the classes are represented by rectangles, the areas of which are proportional to the number of individuals in each class.
The histogram illustrates graphically the high frequency with which measurements are recorded in the central classes of 33 to 35 in., and the increasing rarity with which the more extreme meas urements occur, whether of very large or of very narrow chests. It is easy to see that the most frequently occurring chest girth (the mode) will be nearly the average or mean girth, and that these will both be near a third value, the median, which divides the pop ulation into two equal portions, half being larger than the median and half smaller. Further, the f orm of the histogram evidently gives a good idea of how variable the population is, for the same area might have been more concentrated than it is in the central values, and spaced over a smaller number of classes, representing a less variable population. Or, on the contrary, it might have had a lower central hump, and be spaced over a wider range, if the population had been more variable.
The statistical treatment of frequency distributions is much facilitated by the fact that, in a large number of cases, the ob served distributions conform, at least, to a good approximation to a definite mathematical form known as the normal distribution. This is specified by the law that the logarithm of the frequency in an infinitesimal range of the variate is a quadratic function of the variate itself. The variable part of this function may be (x — m)2 . .
written as in which m designates the central point, or 2 0.2 mean, of the symmetrical distribution, and cr2 designates the variance of the distribution, and provides a quantitative measure of the amount of variation present ; its square root, cr, is called the standard deviation of the distribution.
From any sample of observations it is important, therefore, to obtain estimates of the two parameters m and a• which specify the population sampled ; such an estimate has been termed a statistic. It has been demonstrated that the best obtainable esti mate of the quantity sn is found by calculating the arithmetic mean of the observations, written x = IS(x) where n is the num ber of observations in the sample and S denotes summation over all the observations. The best estimate of the variance may then be written — I S(x — When these two statistics have been calculated, the corre sponding curve of frequency distribution may be constructed, as is shown in the figure by the dotted curve superimposed on the histogram. It will be seen that this process has removed the two arbitrary and fortuitous elements in the original representations; (a) the discontinuities, introduced by our arbitrary choice of units of measurement and grouping, are replaced by a curve show ing continuous variation of frequency; (b) variations, due to the chances of random sampling, of the numbers in the different classes, are obliterated except in so far as the estimates of in and Q have been influenced by these errors of random sampling.
The errors to which statistical estimates are liable form the branch of the subject to which the greatest amount of attention has been given in recent years. For samples from the normal distribution it is known that the error distribution of the mean, is itself normal, with variance equal to io2; the distribution of 11 s (the standard deviation estimated from the sample) is not mal, but is known with exactitude, and for large samples it ap proximates to the normal form with variance equal to I Using 27/ our estimate s for the value of a, these variances and the cor responding standard errors may be regarded as known ; thus hav ing a mean chest girth of 34"•oog, with an estimated standard error o"•0443, we have reason for some confidence that the mean girth of the population sampled lies between the limits 33"•920 and 34"• I 98. The result may be expressed otherwise by saying that the mean chest girth is significantly greater than 33".92, i.e., it exceeds this quantity by an amount which cannot readily be ascribed to chance.
It will be noted that although the measurements are actually taken only to the nearest inch it is improbable that the estimate of the mean will be in error by as much as of an inch, and its standard error may be diminished without limit by taking larger samples. This important fact depends on the positive and negative errors of measurement neutralizing each other more and more exactly as the number in the sample is increased; it would not be true if the graduations of the tape measure used were in error. Similarly, in estimating the variance of the population, an allowance (Sheppard's correction) is usually made for the var iance introduced by taking the measurements only to the nearest inch ; if, however, owing to careless measuring a proportion of cases are not measured truly to the nearest inch, this proportion being greatest when the true measurement is near a class bound ary, a part of the variation observed will be really due to these additional errors of measurement. In order to be sure that these errors are sufficiently rare the precaution should be taken of ob taining duplicate measurements, on different occasions, of a num ber of individuals.
The methods outlined above find a very wide application in the study of the correlation or covariation of two or more variates. For this purpose a two-way frequency distribution is used which expresses, as the result of direct enumeration, how frequently both of the variates shall simultaneously have values between assigned limits. The utility of this method lies in the great choice which exists in the pair of variates chosen. Thus, if we have the height and weight of each of a number of individuals, a two-way distribution will show how frequently an individual chosen at random will have any given combination of height and weight ; how frequently an individual of given weight will have an assigned height ; and how frequently an individual of given height will have an assigned weight. Definite mathematical relations (re gression equations) will be found to express the average weight of persons of a given height in terms of that height, or the average height of persons of a given weight in terms of that weight. Finally, it is sometimes useful to evaluate an abstract number, the correlation coefficient, which measures on an arbitrary quantitative scale the closeness of the interrelation between the two varieties.
Again, the two variates may be similar measurements of related persons, such as are the heights of parent and child. The regres sion equation, which expresses the average height of the child in terms of the height of the chosen parent, is then of great impor tance, for it provides a direct measure of the efficacy of selection, natural or artificial, in modifying the average character of the population. For human physical measurements the simple rule is found to hold that, if one parent only is selected to be one unit above the average, then the next generation will have advanced by half a unit ; but if both parents are selected the advance is ap proximately s of a unit. Similar biometrical studies are in creasingly important in the improvement of livestock. In cases of inheritance the correlation coefficient may be used as a measure of the heritability of the character in question, as when it is found that certain mental and moral qualities are associated in near rela tives to just the same extent as are physical measurements. Alternatively it may be taken to measure the closeness of relation ship, as when it is shown that twins of like sex are more closely alike than are ordinary brothers or sisters.
The study of simultaneous distributions has opened out an immense field of research by providing a precise and objective method of studying vague and ill-understood influences—for ex ample, meteorological and sociological. It is essentially a pioneer method, and is usually replaced as soon as exact knowledge is available of the causes at work. Pairs of values may be associated for innumerable reasons, but the first step, which the two-way table provides, is to find if they are or are not in fact associated.
In most cases there is a wide choice in the statistics which might be calculated as estimates of the parameters characteristic of the population. These will differ greatly in the efficiency with which they utilize the information supplied by the data. Methods are, however, available for obtaining in any particular case statis tics which shall be efficient in this respect.
As further exact solutions are obtained of the error distribution of different statistics, so accurate tests of significance appropriate to the different problems which arise in practice are being devel oped. For a single normally distributed variate it is possible to test accurately whether the mean and variance do or do not differ significantly from given hypothetical values, as also whether two means or two variances obtained from observations are or are not in agreement. For the simultaneous distributions of two or more variates the theory of the regression coefficients and of the co efficient of correlation, in terms of which the dependence of one variate upon another may be expressed, is now almost equally complete. In addition, tests of goodness of fit are available for comparing the frequency of occurrence of the different classes with the frequencies expected by hypothesis, and for testing the adequacy of regression formulae, linear or non-linear, involving one or more independent variates. These tests may all be devel oped in the form known as the analysis of variance. Although much further mathematical research is required before the prac tical needs of biologists in this field will be fully met, the methods already established are adequate for a very large number of pur poses, not the least of which is the increased precision and effici ency in the design of biological experiments.
See T. L. Kelley, Statistical Method (1923), principally for psy chologists; R. A. Fisher, Statistical Methods for Research Workers (1925), primarily for biologists, with special attention to exact tests of significance, and the treatment of small samples ; Pearl, Introduction to Medical Biometry and Statistics (1923) ; G. U. Yule, An Introduc tion to the Theory of Statistics (1927) ; Recommendations for the taking and presentation of biological measurements (British Associa tion, 1927), practical recommendations of a joint committee of biologists and statisticians; Biometrika, a journal for the statistical study of biological problems, ed. K. Pearson. (R. A. F.)