—A look at measurement and its application in software engineering.
In Data Measurement and Analysis for Software Engineers, I go through Part 2 of a talk by Dennis J. Frailey. Part 2 looks at basic analysis techniques for software engineering and what makes a good measurement process. In this article, I look at Part 3 of the webinar. Part 3 covers statistical techniques for data analysis.
Perhaps the most valuable information in Part 3 is the discusson on dealing with normal and non-normal sets of data. Most software metrics result in sets of data that are not on a normal distribution. Most common statistical techniques assume a normal distribution. Misapplying statistical techniques leads to a variety of errors.
In what follows, I highlight the most important elements from the webinar including:
Software metrics may not contain normally distributed data. This section includes a description of the statistic and caveats for identifying when not to use them.
A measure of central tendency is
a single value that attempts to describe a set of data by identifying the central value within that set of data.
These are the
If these values are all equal it suggests, but does not guarentee, that the set of data has a normal distribution. The potential for skew in data makes it critically important to pay attention to the distribution in the data.
If the values are positively skewed then \(mode > median > mean.\) If negatively skewed then \(mean < median < mode.\) If positive or negative skew is present the mean will not identify the middle value. The median will.
The mode can be
Bimodal and multimodal data sets occur whenever there are several local maximums.
The mode can be computed for any scale (i.e., nominal, ordinal, interval and absolute scales). The median requires at least an ordinal scale; the mean requires at least a ratio scale. Most software metrics rely upon ordinal scales, so only the median and mode are applicable.
A measure of dispersion is
a single value that attempts to descibe a set of data by identifying the spread of values within that set of data.
The most common measures of dispersion are
Variance and standard deviation both require a ratio scale.
A confidence interval is
a range of values that have a strong chance of containing the value of an unknown parameter of interest.
A confidence level is
a probability that a value will fall within the confidence interval.
A margin of error is
normally half the confidence interval.
Using continuous functions to approximate data requires understanding confidence intervals and levels.
Conclusions are valid only if the confidence level is high (\(95\%\)). Such levels usually have narrow confidence intervals. A wide interval means the data includes a wide range of values. A wide range means variance is high. High variance is often a problem with software metrics.
A challenge with metrics is the presence of uncontrolled variables. They require that information be evaluation using
A challeng with available information is that they are just opinion. In the worst case, they are exagerations. A challenge with software engineering studies is that they are not easily reproduced or are reproduced only a small number of times.
Don’t draw conclusions using poorly designed studies.
A study looks at data. It provides evidence for a hypothesis but not proof. The purpose of a study is to
By comparision, an experiement generates data. To properly establish cause and effect, an experiment must be repeated independently many times. If the hypothesis can be independently verified by many experiments, then it becomes a theory. A theory is a possible explanation for observed behaviour that accurately predicts the phenomena.
A hypothesis is a possible explanation of an observed phenomenon. The goals of a hypothesis are to explain a phenomenon and to predict the consequences. To test a hypothesis conduct an experiment to evaluate cause and effect.
A hypothesis speculates that given certain values of some independent variables we can predict the values of dependent variables. Independent variables are controlled (changed) to study their effect on dependent variables.
In hypothesis testing there may be
Examples of confounding variables include the environment, tools, processes, capabilities of people and application domain.
Uncontrolled and confounding variables may affect the outcome of any study or measurement. The more of these variables the less reliable the results.
We can evaluate evidence by asking
A test of significance is a measure of how likely the results are due to chance. In classical statistics, significance means there is a \(95\%\) confidence level that the hypothesis is true. This confidence level is difficult to show if the data is widely dispersed.
To test or confirm a hypothesis use an analyis of variations (ANOVA):
If the differences are statistically significant the hypothesis might be right–there is empirical evidence for the difference. If not, then the hypothesis is probably wrong.
A statistical significance result provides support for a hypothesis. It does not prove it.
Errors that might show statisitically significant results, even when the hypothesis is incorrect include
Statistical techniques commonly used for ANOVA.
Exploring relationships is best done using robust statistics. Robust statistics include graphs, charts and diagrams. Approaches used include
The major considerations when choosing an evaluation technique.
The sample size is important because many statistical techniques compare the relative sample size to the population size to suggest a confidence interval. Other considerations are whether the sample is truly a random selection. A larger sample size generally improves confidence. So does running multiple experiments.
Many statistical techniques assume
Good technique requires that you test your assumptions prior to using the technique.
Ways to contaminate normally distributed data
For non-normal distributed data
Nonparametric methods do not use parameterized families of probability distributions. They tend to be used on populations of ranked data and may not need mean, median or variance. (Data ranked using ordinal scales.)
The are used on two main categories of data