Topographic MapTECHNIQUE NUMBER 12
BASIC STATISTICAL TECHNIQUES
Any numerical values
The major reason for numerical approaches to geologic or geomorphic problems is to estimate their uncertainty. It is much better, for example, to be able to state that "glacial equilibrium-line altitudes (ELA's) during the the last glaciation were 450 + 50 m lower than present in Montana, but 1000 + 100 m lower in Colorado" than to say that during glacial times "glaciers persisted to lower altitudes than at present throughout the Rocky Mountains". Statistics is often described as "the language of science" because it enables the testing of hypotheses. Were glacial-age ELA's lowered significantly more in the central than in the northern Rocky Mountains? Yes! The following is a brief, perhaps over-simplified summary of some potentially useful statistical tools.
Mode. The mode is the most common measurement. It is a strong ("robust") statistic which can be generated for any type of observation )colors, size classes, frequencies, etc.). A distribution may have one mode (unimodal), two modes (bimodal) or more (polymodal).
Median. The median is the central value - half of the measurements are higher, half are lower. Clearly, a median cannot be defined for colors, which have no "high" or "low", but it would work for integer data or any other ordinal (ranked) data or rational numbers.
Mean. The mean is the average value - calculated by summing all values and dividing by the number of measurements. Although an average can be calculated for any set of numbers, a mean is strictly valid only for real numbers and assumes a normal distribution ("bell-shaped curve"). NOTE: if the distribution is normal, the mode, median, and mean will be very similar.
Range. For ordinal or higher data, a common indicator of dispersion is the highest minus lowest data values - the range. It the very least, if numerical values are measured, the range should be reported regardless of the shape of the distribution.
Standard Deviation. For normally-distributed data, the central tendency (mean) should be accompanied by the measure of dispersion of individual values - the standard deviation. The standard deviation is estimated as the square root of the average of the squared differences between individual measurements and the mean (UGH!). If you are using a spreadsheet or mathematics calculator, the standard deviation can readily be calculated for you. If not, you can rapidly estimate the standard deviation as the distance on either side of the mean which includes about 2/3 of the data. That is, the mean + 1 s.d. includes about 2/3 of the observations. The mean + 2 s.d. should include about 90% of the observations.
Standard Error. You might be concerned less with the variability of the single measurements from which a mean was generated than with the confidence in the estimation of the mean itself (the "estimate"). In this case, the "standard error on the estimate" or "standard error on the mean", or simply the standard error is the statistic of choice. The standard error is calculated by dividing the standard deviation by the square root of the number of measurements. Note that the standard deviation doesn't change significantly as the number of measurements increases, but the standard error decreases in proportion to the square root of the number of measurements. In other words - taking more measurements doesn't reduce dispersion of individual measurements, but does increase your confidence in the central tendency.
The use of mean, standard deviation, correlation, regression, and other Gaussian statistics assumes a normal distribution. In the natural sciences, data are often skewed; that is, the data are concentrated to one side of the mode. This is particularly noticeable for sizes in Geology, from sediment sizes to stream discharges. Such data can often be transformed to a normal distribution to analyze through Gaussian statistics. Transformation involves applying a standard mathematical operation like squaring or taking the square root, or most commonly by taking the logarithm (base e - "natural log", base 2 - as in sediment sizes, or base 10). Log10 transformation is often used for positively skewed data sets (a tail much larger above than below the mean) because it maintains the decimal focus of most science. For example: log101 = 0 (100 = 1), log1010 = 1 (101 = 10), and log10100 = 2 (102 = 100). The easiest way to work with transformations is to:
NOTE: