May 28, 2020

A Look at Scales of Measurement

  —A closer look at scales of measurement.

In Data Measurement and Analysis for Software Engineers, I discuss the first of four webinars on measurment by Dennis J. Frailey. A key point in the first webinar is the importance of understanding the scales used for measurement and the statistical techniques usable on those scales.

Here, I take a look at scales of measurement and Stanley S. Stevens’ article On the Theory of Scales of Measurement. Stevens was interested in measuring auditory sensation. I’m interested in defect severity. Both measures have some surprising commonality.

Stevens defines measurement as

the assignment of numerals to objects or events according to rules.

Stevens wanted to measure the subjective magnititude of auditory sensation against a scale with the formal properties of other basic scales. Basic scales include those used to measure weight and length.

The rationale for Stevens’ focus on scales of measurement was the recognition that

  • rules can be used to assign numerals to attributes of an object.
  • different rules create different kinds of scales for these attributes.
  • different scales create different kinds of measurement on these attributes.

These differences require definition of

  1. the rules for assigning measurement,
  2. the mathematical properties (or group structure) of the scales and
  3. the statistcal operations applicable to each scale.

Rules for Specifying Severity

Defect severity is an example of an ordinal scale. An ordinal scale includes:

  • rules for specifying severity.
  • the properties of a nominal scale for severity, that permit
    • contigency correlation for severity and
    • hypothesis testing of defect severity. This scale has properties that permit the determination of equality.
  • the properties of an ordinal scale for severity. These properties permit the determination of greater or less than.

An example where rules are needed is the determination of the severity of a defect. Without rules, severity level reflects the reporter’s intuitive notion of severity. This notion is likely to differ by reporter and possibly by the same reporter over time.

Rules ensure that different reporters consistently apply severity level to defects. Consistently applying severity level ensures that all defects have the same scale of severity. Using the same scale means the mathematical properties of that scale are known. This permits determination of the applicable statistical operations for severity.

The focus on rules is a response to the subjective nature of the measurement. Defect severity is a subjective measure, as shown by the definition of severity (see IEEE Standard Classification for Software Anomalies):

The highest failure impact that the defect could (or did) cause, as determined by (from the perspective of) the organization responsible for software engineering.

This standard defines the following severity levels.


Importantly, this definition seeks a scale of measure that reflects a common prespective of severity. This common perspective is created using severity level and shared notions of essential operation and of significant impact.

The Properties of a Nominal Scale for Severity

Severity has the properties of a nominal scale. A nominal scale uses numerals (or labels) to create different classes. Stevens’ calls this a nominal scale of Type B. A Type B nominal scale supports the mode (frequency) and hypothesis testing regarding the distribution of cases amongst different classes and contengency correlation.

A hypothesis test might use severity from all releases to determine if overall quality has improved. To be meaningful, this test would require the same rules for assigning severity applied to each release included in the test.

Contigency Correlation for Severity

An example of contengency correlation for defects by release uses two categorical variables: Release and Severity. Release has 2 cases (i.e., R1 and R2); Severity has 5 cases (i.e., Blocking through Inconsequential).

Releases R1 and R2 (the most recent release) have these observed defect counts by severity:1

Release Blocking Critical Major Minor Inconsequential Total
R1 23 (14.4%) 27 (16.9%) 41 (25.6%) 25 (15.6%) 44 (27.5%) 160 (52.1%)
R2 44 (29.9%) 49 (33.3%) 7 (4.8%) 28 (19.0%) 19 (12.9%) 147 (47.9%)
Total 67 (21.8%) 76 (24.8%) 48 (15.6%) 53 (17.3%) 63 (20.5%) 307 (100.0%)

Observed characteristics:

  • The totals provide the observed frequency of each severity level using the last two releases. That is, the frequency of Blocking severities is \(67\) out of \(307\) observations.
  • The mode severity is Inconsequential, Blocking and Critical for R1, R2 and R1 and R2, respectively.

Releases R1 and R2 have the following expected defect counts by severity:2

Release Blocking Critical Major Minor Inconsequential Total
R1 34.9 (21.8%) 39.6 (24.8%) 25.0 (15.6%) 27.6 (17.3%) 32.8 (20.5%) 160 (100.0%)
R2 32.1 (21.8%) 36.4 (24.8%) 23.0 (15.6%) 25.4 (17.3%) 30.2 (20.5%) 147 (100.0%)
Total 67.0 (21.8%) 76.0 (24.8%) 48.0 (15.6%) 53.0 (17.3%) 63.0 (20.5%) 307 (100.0%)

Expected characteristics:

  • All expected values are normalized using the frequencies in the total row of the observed table.
  • The severity frequency is Critical for R1, R2 and R1 and R2, respectively.
  • R1 has lower than expected Blocking, Critical and Minor severity counts than what is observed. Higher than expected in Major and Inconsequential than what is observed.
  • R2 has lower than expected Major and Inconsequential severity counts than what is observed.

Graphically, these look like:

The expected values reflect a proportionality dictated by the frequency of the observations. The next section investigates whether this proportionality has any bearing on the observations.

This figure depicts the same observations for R1 and R2 using a box plot.

A box plot shows the maximum and minimum values (the top and bottom crosses), along with the median value (middle of the box). The box top and bottom for the top and bottom 25 pecentiles of the data.

This box plot tells a story of very high Blocking and Critical defects in both releases. It conceals information on the individual releases, for example, has R2 improved?

To understand if R2 is better than R1 the following graphs are helpful.

Clearly, R2 is worse than R1 in terms of the introduction of higher severity defects, despite the fact that the defect counts are smaller. (An area plot seems more informative than a stacked bar plot, but both provide the same information.)

For more emphasis on individual severity counts. (These plots focus on individual severity counts and are less busy than the preceeding line chart.)

Hypothesis Testing on Severity Distributions

A nominal scale permits hypothesis testing of the distribution of the data. Severity is a form of categorical data. Categorical data is divided into groups (e.g., Blocking and Inconsequential).

There are several tests for the analysis of categorical data. The most appropriate one in this case, is the Categorical Distribution. A Categorical Distribution is a generalization of the binomial distribution, called a multinomial distribution.

A binomial distribution is a distribution where a random variable can take on only one of two values (i.e., \(0\) or \(1\)). The number of trials in a binomial is \(n = 1\). A Multinomial distribution where each trial can have an outcome in any one of several categories.

A Categorical Distribution is a vector of length \(k > 0\), where \(k\) is the number of elements in a vector were only one element has the value \(1\). The others have a value of \(0\). The number of trials in a Categorical Distribution is \(n = 1\).

Since a defect can have only one severity level, severity can be viewed as a vector with \(k = 5\) with only one vector element equal to \(1\). (I assume determining severity is a statistically independent event.4)

Statistically, we can use the expected severity to determine the probability for a distribution of severity counts. In the table above, the expected severity is a multinomial distribution.

The probability that these severity counts are \((67, 76, 48, 53, 63)\) is

\[\begin{align} \begin{array}{ccl} Pr(67, 76, 48, 53, 63) & = & \frac{307!}{67! \times 76! \times 48! \times 53! \times 63!}(0.218^{67} \times 0.248^{76} \times 0.156^{48} \times 0.173^{53} \times 0.205^{63}) \\ & = & 1.5432158498238714 \times 10^{-05} \\ \end{array} \end{align}\]

The probability of this severity count distribution is \(0.000015,\) or \(3\) in \(200000.\)

There are \(382,313,855\) discrete combinations of severity that sum to \(307.\) (The sum is a constraint placed upon the multinomial distribution, otherwise the probabilities fail to sum to \(1\).) Some statistics on this probability distribution, including counts by severity:

We want to test the null hypothesis \(H_{0}\) is the null hypothesis of homogenity. Homogenity implies that all probabilities for a given category are the same. If the null hypothesis fails, then at least two of the probabilities differ.

For this data set, the \(\chi^{2}\) parameters are as follows.

\[\begin{align} \begin{array}{lcr} \chi^{2} & = & 46.65746412772634 \\ \mbox{p-value} & = & 1.797131557103987e-09 \\ \mbox{Degrees of Freedom} & = & 4 \\ \end{array} \end{align}\]

The null hypoyhesis is rejected whenever \(\chi^{2} \ge \chi^{2}_{\alpha, t - 1}\). Clearly, the p-value, an indicator of the likelihood the results support the null hypothesis, is false. Two or more probabilities differ.

The 95% confidence intervals with \(\alpha = 0.05\) and \(Z_{1 - \alpha /2} = 1.96\) are computed: \[ p_{j} \pm Z_{1 - \alpha /2} \times \sqrt{\frac{p_{j} \times (1 - p_{j})}{n}}, \forall \, j, 0 \le j \le n. \]

The confidence intervals for each category are:3

\[\begin{align} \begin{array}{ccc} Blocking & = & 0.218 +/- 0.046 \\ Critical & = & 0.248 +/- 0.048 \\ Major & = & 0.156 +/- 0.041 \\ Minor & = & 0.173 +/- 0.042 \\ Inconsequential & = & 0.205 +/- 0.045 \\ \end{array} \end{align}\]

These values provide a confidence interval for each binomial in the multinomial distribution. They are derived using the “Normal Approximation Method” of the Binomial Confidence Interval.

Severity as an Ordinal Scale

Severity is referred to as an ordinal scale. Although it has nominal scale properties, it supports the properties of an ordinal scale. In effect the operations permitted on a scale are cumulatitve–any analysis supported by a nominal scale can be conducted on an ordinal scale.

This cumulative effect of scales is one of two major contributions made by this paper. The other being the definition of measurment itself. Isn’t that cool?

An ordinal scale permits determination of greater or less than–a Blocking defect is more severe than a Critical one. This is why we can say that R2 is worse than R1, despite the reduction in reported defects. There are more Blocking and Critical defects in R2 and these, because of the properties of an ordinal scale, are more severe than the others.

I don’t compute the median of severity in these releases because the distance between ordinals is unclear. Better to stick with frequency as it’s unambiguous.

References

The code used to generate the graphs: GitHub.

Footnotes

1. Severity counts randomly generated courtesy of RANDOM.ORG. R1 and R2 were generated by requesting 5 numbers in the range of 1 through 100, inclusive.

2. Each entry is calculated using the formula \(\frac{\mbox{row total} \times \mbox{column total}}{\mbox{table total}}\).

3. To make statistically valid conclusions a population of 301 is requred for a confidence interval of 1 and confidence level of 99%. This implies that this table is saying something significant regarding the observed severity in both release R1 and R2.

4. Is a defect severity a statistically independent event? It is because the severity of one defect does not influence the severity of any other defect. By contrast, defects are not independent events. For example, a defect in requirements that remains undetected until implementation will generate a collection of related defects. Is the severity of those related defects the same as the one identified in requirements or different?

May 22, 2020

Data Measurement and Analysis for Software Engineers

  —A look at measurement and its application in software engineering.

In Data Measurement and Analysis for Software Engineers, I go through Part 1 of a talk by Dennis J. Frailey. Part 1 describes scales of measurement and the importance of understanding which statistical operations make sense on those scales. Scales are important because they define the properties of what is being measured.

Part 2 looks at basic analysis techniques for software engineering and what makes a good measurement process. Importantly,

If you only measure the code, you will probably not really understand your software or its development process.

Why? Because many software products are not code. They include specifications, design models, tests, etc. An error in a software product can result in errors in those that depend upon it.

Measurement smells:

  • Requirements are not measurable.
  • Activities are not measured.
  • Software quality attributes are not measured or modeled.
  • Performance of tools and methods employed are not based on factual data.

The countermeasure to these smells is the measurement process. It

  • establishes a measurement program,
  • implements a measurement program and
  • evaluates the measurements.

The International Standard for Software Measurement Processes (ISO/IEC/IEEE 15939) describes describes a set of criteria for a good measuring process. It does not define a process but tells you what a good process should be like.

A good measurement process requires establishing what your information needs are. Identifying what to measure is the subject of The Goal/Question/Metric (GQM) Paradigm.

The webinar walks through a process for metrics.

  1. Organize the base measures (stuff we measure) by refining or compressing it. This ensures that data comprising base measures is consistent.
  2. Compute derived measures from base measures. This ensures that derived measures are directed at fulfilling information needs.

Base measures must be established with consideration of their scales of measure. This determines the meaningful statistic calculations as part of the derived measures.

The use of probability distributions and statistics is the subject of the third webinar. This includes discrete and continuous functions.

April 29, 2020

Data Measurement and Analysis for Software Engineers

  —A look at measurement and its application in software engineering.

This post is motivated by a series of SIGSOFT webinars by Dennis J. Frailey entitled Fundamentals of Measurement and Data Analysis for Software Engineers. Part I lays out the foundation of measurement theory for software engineering. The first webinar includes a field chart for scales and permitted operations. It’s worth the price of admission.

The ultimate goal for software metrics is to help software professionals … make decisions under uncertainty.

Why do we measure? To make things more visible and controllable. Better information leads to more informed decision making. The key is proper selection, collection, analysis and interpretation of metrics.

In discussing the analysis and interpretation, Frailey references a paper by Stanley S. Stevens entitled On the Theory of Scales of Measurement. This paper weighs in at 3 pages and tells a story of the challeges in creating a definition of measurement. They arrived at

measurement is the assignment of numerals to objects or events according to rules.

The paper sets out to make explicit

  • the rules for assigning numerals,
  • the mathematical properties of the resulting scales and
  • the statistical operations permitted on each scale.

Misunderstanding scales and the statistical analysis on measures in those scales leads to poor decisions.

Frailey refines the definition of measurement using Fenton’s definition:

Measurement is
… the process by which
numbers or symbols
are assigned to
attributes of entities in the real world
in such a way so as to
describe them acording to clearly defined rules.

Importantly,

The assignment of numbers must preserve the intuitive and empirical observations about the attributes and entities.

Entities, Attributes and Values of Attributes

Attributes are features, properties or characteristics that allow us to distinguish entities from each other or compare and evaluate entities.

Attributes are often measured by numbers or symbols–numbers don’t always make sense. They permit distinguishing entities or comparing them. An attribute is a property or characteristic of an entity that can distinguish them quantitatively or qualitatively.

A defect identifier is best represented as a unique symbol (e.g., BUG-1). An estimate and time spent on a defect is represented as a unit of time; severity is a category (e.g., High (H) and Low (L)).

Basic Rules and Principles

Measures should lead to useful information–you should have a purpose for every measure. Alternatively, don’t collect measures if you don’t know how they will be consumed. It’s pointless and distracting.

A consistent set of rules can indicate the type of measurement results. In most cases, formulation of the rules of assignment determines the nature of the scale. If there is any ambiguity the group formed by the scale–and the what ways it be transformed and still serve it’s function? Measurement is never better than the empirical operations by which it is carried out.

If our product backlog comprises stories and defects it is meaningful to count them within their respective categories. It is meaningful to count the defects attributed to a story.

Comparing stories to defects is meaningless unless a comparison can be made with time spent or estimates. Since stories lack a severity attribute they are different from defects. This difference prevents direct comparison.

Good measures rely upon a model that reflects a specific viewpoint.

  • What we are going to measure?
  • What attribute are we collecting data on?

Identifying what to measure is the subject of The Goal/Question/Metric (GQM) Paradigm. It is also the realm of the scientific method.

A measure is a variable to which a value is assigned as a result of measurement. Data is a collection of values assigned to measures. Identifying how to measure something lies in the realm of measurement theory.

Measurement theory differentiates between base and derived or calculated measures.

  • A base measure or direct measure is a direct quantification of an attribute. It is generated through observation of the thing being measured (e.g., time spent on a story or defect).

  • A calculated or derived measure is a measure defined in terms of other measures. They are new measures defined in terms of base measures (e.g., the average number of hours spent on defects).

Measurement involves the assignment of numbers to attributes of the entities being observed. That assignment determines how the entity is classified and what statistical operations have meaning on that object. The analysis of measures requires understanding the scales of measurement.

Scales of Measurement

A scale of measure is also known as a level. They help us understand what we are seeing.

A scale is a collection of symbols or numbers used to classify attributes of entities. It is a system for describing the nature of information.

Different scales have different properties. Those properties determine which forms of analysis are appropriate on them.

Stevens classifies scales into the following groups.

  • nominal
  • ordinal
  • interval
  • ratio

The further down this list the more sophisticated analysis you can perform.

Nominal Scale

Nominal scale enables categorization but does not support ordering. In these scales, numbers and symbols are used only as labels. For example, colour and a football jersey number used to identify a player are both nominal scales. (Colour has a natural order if you look at wavelength, but let’s ignore this for purposes of this example.).

You can count the size or frequency (mode) of each category. There is no average or a median nor is there a natural category ordering.

In a nominal scale you can easily count stories and defects and identify individual objects using the identifier. There is no way to determine which story should be ordered before another.

You can use numbers to categorize a nominal scale but there is no numerical meaning. Changing the shape assigned to stories does not change the scale. Likewise changing the label “Story” does not change the scale-there are still three objects in the category.

In a nominal scale, do not assign the same numeral or symbol to different categories or different numerals and symbols to the same category. For example, Story and Defect are assigned different shape to differentiate them. If they had been assigned the same shape they would be indistinguishable from each other.

Ordinal Scale

Ordinal scales enable ranking or ordering the data. For example, the serverity of defects are often ordered as high, medium and low, so defects can be ranked by severity.

Items in an ordinal scale can be sorted. This permits the middle item to be identified. Ordinal items do not support the notion of average (and hence standard deviation). There is no mathematical relationship between ordinal values–the degree of distance between categories is not defined.

Stevens says that averages and standard deviations computed on ordinals are in error to the extent that successive intervals on the scale are unequal. Percentages are dangerous if interpolating linearity within a class interval. Likewise, interpolating the mid-point in a class interval using linear interpolation is dangerous because linearity in an ordinal scale is an open question.

Interval Scale

Interval scales are ordered and there is a fixed distance between consequtive members. Examples include dates, times and temperatures.

Given any two dates (times or temperatures), you can count the number of intervals between any two points on the scale. You can add or substract values, order them and calculate the distance between any two of them. You cannot multiply or divide or calculate a ratio. There is no true zero (e.g., when time began).

Computing ratios of intervals is a common error–so a ratio of the time start of two stories or the start and end times for a story is meaningless. You can compute ratios between differences. It is fine to say the ratio of time spent on two different stories is double or half of the other.

The zero point on an interval scale is usually a matter of convenience (e.g., Centigrade and Fahrenheit temperatures in comparison with Kelvin).

Ratio Scale

Ratio scales support multiplication and division. They support equality, rank-ordering, equality or intervals and equality of ratios. There is a true zero. They are commonly encountered in physics.

An example of a ratio scale is speed.

Ratio scales are either fundamental or derived. A fundamental scale measures a true value (e.g., speed). A derived scale is a function of one or more values from a fundamental scale (e.g., acceleration) All types of statistical measures are applicable. You can compute an average or mean (and thus standard deviation).

Absolute Scale

An absolute scale is a scale where all mathematical operations (e.g., average and exponentiation) are meaningful. It is a system of measurement that begins at a minimum, or zero point, and progresses in only one direction.

Pressure, length, area and volume are all measured using an absolute scale.

Sample Size

Suppose you have a large number of entities. How many are needed to make a predictions about them? Sample size is the number of entities to measure in ensure meaningful predictions.

To determine a the sample size, how many entities needs to be understood. The total number of entities is the population.

To make meaningful predictions the proportion of sample size relative to the population count is needed. Sampling data includes margin of error and sample size. Sample size depends upon the circumstances (what is the size of your sample as a percentage of total, how well have you selected your sample)?

A sample size of 1% is less useful than one representing 10% or 20% of the population. Need to consider what proportion of the population is included in the sample size–the larger the sample size the better.

Resources

Some of these resources make me wonder if some of the learning in psychology on creating measures is applicable to software metrics. Not thinking Likert but more about methodology and identification.

This thinking contradicts a little of what On the Application of Measurement Theory in Software Engineering says, but not entirely. An interesting question when posed in the context of some discussion in Statistics and the Theory of Measurement and Measurement Theory with Applications to Decisionmaking, Utility, and the Social Sciences.

Suppes thought through measurement on many different levels. Methodology, Probability and Measurement. A great resource but a hard read.

April 23, 2020

Another Look at Base Rate Fallacy for Shared Resources

  —Logical fallacy and broken models.

Here’s another look at the base rate fallacy I discused in Logical Fallacy and Memory Allocation when applied to a shared resource. I’ll use memory allocation again.

This is a contrived example. I am not suggesting this anaylsis is correct. In fact, I’m trying to show the opposite (the fallacy of this approach).

Make sense?

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = dict()

Let \(P(E = n)\) be the probability of the event occurring.

We know that \(P(E = 1) = 1\), because this user initiated event must occur at least once in order to use the system.

Assume most users will have \(n > 1\). Assume that \(P(E = 2) = \frac{1}{2}\), \(P(E = 3) = \frac{1}{3}\) and \(P(E = n) = \frac{1}{n}\), for \(1 \le n \le 300\).

The probability distribution function can be any function showing the probability of an event decreasing as the number of events increases.

def event_pdf(n):
    """Calculate the probability of the event occuring."""
    return 1 / n

df['P(E = n)'] = [ event_pdf(n) for n in range(1, 301) ]
for n in [ 0, 1, 2, 299]:
    print("P(E = {:3}) = {:1.3f}".format(n + 1, df['P(E = n)'][n]))
P(E =   1) = 1.000
P(E =   2) = 0.500
P(E =   3) = 0.333
P(E = 300) = 0.003

Let \(P(C)\) be the probability of system memory being exhausted.

Then \(P(C \vert E = n) = \frac{P(C) \cap P(E = n)}{P(C)}\).

Assume \(P(C \vert E = n) = 0\) for all \(1 \le n < 300\) and \(P(C \vert E = 300) = 1\). That is memory exhaustion only occurs when \(300^{th}\) event occurs.

(The careful reader will note the flaw here. While true the system crashes upon the \(300^{th}\) event, it is wrong to consider this in isolation. But lets continue our argument with this flaw.)

def crash_given_event(n):
    """Calculate the probability of P(C|E = n)."""
    return 0 if n < 299 else 1

df['P(C|E = n)'] = [ crash_given_event(n) for n in range(0, 300) ]

We know \(P(C \vert E = 1) = 0\) and \(P(C \vert E = 300) = 1\).

for n in [0, 1, 2, 299]:
    print("P(C|E = {:3}) = {:1.3f}".format(n, df['P(C|E = n)'][n]))
$$P(C \vert E =   0) = 0.000$$
$$P(C \vert E =   1) = 0.000$$
$$P(C \vert E =   2) = 0.000$$
$$P(C \vert E = 299) = 1.000$$

Let \(M(E = n)\) be the proportion of memory consumed at the completion of the \(n^{th}\) event.

We know that \(M(E = 300) = 1\) and \(M(E = 1) = 0.75\). We know each event consums \(0.0875\)% of remaining memory.

def memory(n):
    """Determine percentage of consumed memory."""
    if 0 == n:
        return 0.75
    elif 299 == n:
        return 1
    else:
        return 0.75 + (n * 0.000875)
    
df['Memory (% Used)'] = [ memory(n) for n in range(0, 300) ]

We know that the system exhausts available memory immediately when the \(300^{th}\) event occurs.

fig,ax = plt.subplots()

for name in ['P(C|E = n)','P(E = n)', 'Memory (% Used)']:
    ax.plot(df[name],label=name)
    
ax.set_ylabel("probability")
ax.set_xlabel("event number")
ax.set_title('system outage')
ax.legend(loc='right')
$$\text{The model} P(C \vert E = n)$$.
$$\text{The model} P(C \vert E = n)$$.

The system outage graph above is our model of the memory leak if we reason about the leak in isolation.

We can improve this model significantly and still make the same error.

def crash1_given_event(n):
    """Calculate the probability of P(C1|E = n)."""
    return 1 / (300 - n)

df['P(C1|E = n)'] = [ crash1_given_event(n) for n in range(0, 300) ]
fig,ax = plt.subplots()

for name in ['P(C1|E = n)','P(E = n)', 'Memory (% Used)']:
    ax.plot(df[name],label=name)
    
ax.set_ylabel("probability")
ax.set_xlabel("event number")
ax.set_title('system outage')
ax.legend(loc='best')

The model \(P(C1|E = n)\) considers the increase in probability of a crash as the number of events increases. Better, but flawed. Flawed because memory is a shared resource.

The model should look something like \(P(C1 \vert E = n, F, G, \ldots)\), where \(F\) and \(G\) are other events affecting memory.

These models are positioned a probability distributions of different events in memory. These make for nice discussion points and reflect the fact that the underlying arguments are probability based.

In my opinion, a better approach is to count the bytes allocated and freed and map this over time and across different use cases.

March 31, 2020

The Goal/Question/Metric (GQM) Paradigm

  —A look at a framework for creating well-aligned software metrics.

In McCall’s Software Quality Model, I discussed a paper tying quality factors to quality criteria. That model takes quality criteria coupled with metrics collected throughout the lifecycle to provide feedback on quality. What’s missing from this model is a discussion on how to arrive at good metrics.

The Goal/Question/Metric (GQM) Paradigm provides a framework published in 1994 for arriving at good measures.

Measurement is a mechanism for creating a corporate memory and an aid in answering a variety of questions associated with the enactment of any software process.

Measurement also helps, during the course of a project, to assess its progress, to take corrective action based on this assessment, and to evaluate the impact of such action.

I like this framework when its coupled with the quality criteria in McCall’s model. What follows is a brief description of GQM and how I think it complements McCall’s model.

GQM must be applied top-down and focus on goals and models. An organization must identify goals and trace those goals to data intended support to those goals operationally. Then provide a framework for interpreting the data with respect to its goals.

An object is the focus of measurement. Objects can be resources, processes or projects.

  • Goals are defined for an object that is to be the focus of measurement. A goal is characterized by a purpose, issue and viewpoint (or perspective). An object is a product (something that is produced), process (an activity) or resource (something consumed).

  • Questions connect the object of measurement to a quality issue. They determine the quality from the viewpoint.

  • Metrics are associated with every question. Metrics are objective if they are the same regardless of viewpoint and are subjective if they depend upon the viewpoint.

A GQM model is developed by identifying a set of quality or productivity goals. Questions are derived for object of measurement to define the goal as completely as possible. Metrics are developed to answer those questions.

Goals are developed from policy and strategy, process and product descriptions and viewpoint to develop the measurement.

Quality factors in McCall’s model affect product operation (fufills specification), translation (ability to adapt software) or revision (ability to change software). These are similar to architectural quality attributes–a measurable or testable property of a system used to indicate how well the system satisfies the needs of its stakeholders. Basili’s GQM cites McCall’s model and identifies it as another means of defining Software Quality Metrics.

The chief contribution of GQM over McCall’s model is the explicit introduction of goals coordinates based upon viewpoint, purpose, issue and object. The explicitness of the goal coordinates creates a wider perspective for goals.

I like the complement between architectural quality attributes, McCall’s quality model, where criteria is identified to measure attributes, and the GQM idea of tying viewpoint to metric.

The Goal/Question/Metric Paradigm overlaps and refines McCall’s Software Quality Model as follows.

Here, goals create quality factors, an extension of the original concept to explicitly include a wider varienty of objects.

I don’t differentiate between the current notion of software architecture quality attributes and quality factors. I see GQM’s notion of goal as a superset because a goal might include resource, time, defects, etc. I view quality attributes as non-behavioural requirements.

Questions aided by goal coordinates motivate quality criterion that connect goals to metrics.

Metrics are an extension over quality measures because they explicitly include subjective and objective measures. Not necessarily missing from McCall’s model but not explicitly called out either.