# June 20, 2020

## Data Measurement and Analysis for Software Engineers

—A look at measurement and its application in software engineering.

In Data Measurement and Analysis for Software Engineers, I go through Part 2 of a talk by Dennis J. Frailey. Part 2 looks at basic analysis techniques for software engineering and what makes a good measurement process. In this article, I look at Part 3 of the webinar. Part 3 covers statistical techniques for data analysis.

Perhaps the most valuable information in Part 3 is the discusson on dealing with normal and non-normal sets of data. Most software metrics result in sets of data that are not on a normal distribution. Most common statistical techniques assume a normal distribution. Misapplying statistical techniques leads to a variety of errors.

In what follows, I highlight the most important elements from the webinar including:

• Statistics for normal distributions of data including:
• Central Tendency Measures,
• Measures of Dispersion and
• Confidence Intervals and Confidence Levels.
• Challenges involving software metrics including:
• Types of Studies,
• Hypothesis Testing and
• Exploring Relationships.
• Choosing an Evaluation Technique including Techniques for Non-Normal Distributed Data.

## Normally Distributed Statistics

Software metrics may not contain normally distributed data. This section includes a description of the statistic and caveats for identifying when not to use them.

### Central Tendency Measures

A measure of central tendency is

a single value that attempts to describe a set of data by identifying the central value within that set of data.

These are the

• mean (the average of all data values),
• median (the middle data value) and
• mode (the most frequent data value).

If these values are all equal it suggests, but does not guarentee, that the set of data has a normal distribution. The potential for skew in data makes it critically important to pay attention to the distribution in the data.

If the values are positively skewed then $mode > median > mean.$ If negatively skewed then $% $ If positive or negative skew is present the mean will not identify the middle value. The median will.

The mode can be

• undefined (all values occur equally),
• bimodal (there are two most frequent values) and
• multimodal (there are multiple most frequent values).

Bimodal and multimodal data sets occur whenever there are several local maximums.

The mode can be computed for any scale (i.e., nominal, ordinal, interval and absolute scales). The median requires at least an ordinal scale; the mean requires at least a ratio scale. Most software metrics rely upon ordinal scales, so only the median and mode are applicable.

### Measures of Dispersion

A measure of dispersion is

a single value that attempts to descibe a set of data by identifying the spread of values within that set of data.

The most common measures of dispersion are

• the range. It is the distance between the largest and smallest values in the set of data. It requires an interval scale to compute.
• the variance. It is a measure of how far the numbers are spread out. It is particularly useful when looking at probability distributions. A variance of zero means the numbers are all the same; a small variance means the numbers are close together and a large variance means they are not.
• the standard deviation. It is a value used to represent the amount of variation or dispersion in a set of data. In a normally distributed set of data each standard deviation is the same size (or the same distance from the mean).

Variance and standard deviation both require a ratio scale.

### Confidence Intervals and Confidence Levels

A confidence interval is

a range of values that have a strong chance of containing the value of an unknown parameter of interest.

A confidence level is

a probability that a value will fall within the confidence interval.

A margin of error is

normally half the confidence interval.

Using continuous functions to approximate data requires understanding confidence intervals and levels.

Conclusions are valid only if the confidence level is high ($95\%$). Such levels usually have narrow confidence intervals. A wide interval means the data includes a wide range of values. A wide range means variance is high. High variance is often a problem with software metrics.

## Software Metrics

A challenge with metrics is the presence of uncontrolled variables. They require that information be evaluation using

• available information or

A challeng with available information is that they are just opinion. In the worst case, they are exagerations. A challenge with software engineering studies is that they are not easily reproduced or are reproduced only a small number of times.

Don’t draw conclusions using poorly designed studies.

### Types of Studies

A study looks at data. It provides evidence for a hypothesis but not proof. The purpose of a study is to

1. test or confirm a hypothesis or theory.
2. explore a relationship.

By comparision, an experiement generates data. To properly establish cause and effect, an experiment must be repeated independently many times. If the hypothesis can be independently verified by many experiments, then it becomes a theory. A theory is a possible explanation for observed behaviour that accurately predicts the phenomena.

### Hypothesis Testing

A hypothesis is a possible explanation of an observed phenomenon. The goals of a hypothesis are to explain a phenomenon and to predict the consequences. To test a hypothesis conduct an experiment to evaluate cause and effect.

A hypothesis speculates that given certain values of some independent variables we can predict the values of dependent variables. Independent variables are controlled (changed) to study their effect on dependent variables.

In hypothesis testing there may be

• many variables,
• variables that are not controllable or
• confounding effects between variables.

Examples of confounding variables include the environment, tools, processes, capabilities of people and application domain.

Uncontrolled and confounding variables may affect the outcome of any study or measurement. The more of these variables the less reliable the results.

We can evaluate evidence by asking

• does the evidence support the hypothesis?
• how likely does it support the hypothesis by chance?

A test of significance is a measure of how likely the results are due to chance. In classical statistics, significance means there is a $95\%$ confidence level that the hypothesis is true. This confidence level is difficult to show if the data is widely dispersed.

To test or confirm a hypothesis use an analyis of variations (ANOVA):

• test two or more methods, tools, etc.
• use statistical tests to determine the differences and see if they are statistically significant.

If the differences are statistically significant the hypothesis might be right–there is empirical evidence for the difference. If not, then the hypothesis is probably wrong.

A statistical significance result provides support for a hypothesis. It does not prove it.

Errors that might show statisitically significant results, even when the hypothesis is incorrect include

• experimental errors.
• the presence of uncontrolled variables.
• the presence of confounding variables.
• use of an invalid statistical technique.

Statistical techniques commonly used for ANOVA.

• Student’s T Test,
• F Statistics and
• Kruskal-Wallis and other advanced techniques.

### Exploring Relationships

Exploring relationships is best done using robust statistics. Robust statistics include graphs, charts and diagrams. Approaches used include

• box plots (summarizes the range and distribution of data for a single variable),
• bar charts (compares a small number of entities),
• control charts (shows trends and abnormalities over time),
• scatter diagrams (shows the relationship between two variables) and
• correlation analysis (statistical methods to supplement scatter diagrams).

## Choosing an Evaluation Technique

The major considerations when choosing an evaluation technique.

• the nature of the data. This includes consideration of
• the relationship of the set of data to the larger population.
• the distribution of the set of data. Many statistical techniques assume normally distributed data.
• the scale of the set of data.
• the purpose of the study. This could be to confirm a theory or hypothesis or to explore a relationship.
• the study design. Identification and use of the best techniques to support the purpose of the study.

The sample size is important because many statistical techniques compare the relative sample size to the population size to suggest a confidence interval. Other considerations are whether the sample is truly a random selection. A larger sample size generally improves confidence. So does running multiple experiments.

Many statistical techniques assume

• the variables are independent.
• the observations are independent.
• there are no controlled or confounding variables.
• all data is described using an interval or ratio scale.
• that the dependent variable is related to the independent variable by a linear function.
• independence of experiments (i.e., one experiement does not influence the other).
• outliers have no significant effect on the mean and median.

Good technique requires that you test your assumptions prior to using the technique.

Ways to contaminate normally distributed data

• by contamination with bad data. Even 1-5 percent of bad data can render statistical techniques invalid.
• by mixing two more more normally distributed datasets that have difference means, medians or modes.
• small departures in the assumptions described above.

### Techniques for Non-Normal Distributed Data

For non-normal distributed data

• determine the underlying distribution and apply techniques suitable for that distribution.
• change the scales.
• use nonparametric statistical methods.

Nonparametric methods do not use parameterized families of probability distributions. They tend to be used on populations of ranked data and may not need mean, median or variance. (Data ranked using ordinal scales.)

The are used on two main categories of data

• descriptive statistics. A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features of a collection of information. A descriptive statistics in the mass noun sense is the process of using and analyzing those statistics. It is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.
• inferential statistics. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.