In Data Measurement and Analysis for Software Engineers, I go through Part 3 of a talk by Dennis J. Frailey. Part 3 looks at statistical techniques for data analysis in software engineering and takes a look at normal and non-normal statistics. In this article, I look at Part 4 of the webinar. Part 4 covers statistical techniques for data analysis that are well suited for software.

Descriptive statistics are used to describe, show or summarize data in a visual way that is meaningful. They are also known as robust statistics. They are not developed using probability theory but may rely in inferential statistics.

Descriptive statistics include central tendency, median, mean and mode and measures of variability. Variability includes standard deviation, variance, minimum and maximum. (And only when these statistics are meaningfully supported by the scale.)

Graphs are commonly used to enhance descriptions. They include:

- box plots
- bar charts and histograms
- control charts
- scatter charts and correlation

## Box Plots

Box plots are useful for showing median and quartile information for quantitative information. They are useful whenever the data is not on a ratio scale or is not normally distributed. The data must be on an ordinal scale or higher. Box plots are good for showing normal and unusual conditions.

Box plots are used for a comparison. They are used on an irregular basis.

## Bar Charts and Histograms

A histogram plots a distribution of data using bars to represent mode for a particular element of the data. Order is important, so the bars cannot be reordered.

A bar chart is used to plot categorical data and can be reordered. They are used to compare values.

A Pareto chart is a bar chart where the bars are ordered from largest to smallest and a line chart is used to generate totals. An interesting review of the statistical basis (and an improvement) is available from Revising the Pareto Chart.

## Control Charts

A control chart is used to track the performance of a process or a machine. It shows actual data against average data in comparision to expected variation (the control limits). The average and expected variation are based upon previous data or projects.

If historical data is unavailable, assume the data is normally distributed.

A control chart tells you when a process or machine is performing outside of its normal limits and shows when to take action. A good example for using control charts with software processes is available as quantitative software management and related CMMI processes.

Control charts are used for tracking a regular basis. Use them to see how things change over time and to spot trends.

## Correlation

An exploration of whether different variables are related. Correlation is not causation.

Popular methods of correlation.

- Pearson coefficent for normal data.
- Spearman coefficent for non-normal data. It is computed on the rank of data and answers the question regarding whether the data can be described using a monotonic function.
- Kendall coefficent for non-normal data.
- Regression uses an equation to determine if there is a relationship.

Spearman and Pearson are measures of association.