September 27, 2021

Graphical Methods for Statistical Analysis (Part II)

In Graphical Methods, I described a problem I was having with an implementation. Basically, I was having difficulty in finding what felt like the right abstraction for a collection of methods that looked like this:

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

I ended up with the following:

@gm.command()
@click.argument('column')
@click.argument('output', type=click.File('wb'))
@click.option('--format', type=click.Choice([ 'png', ], case_sensitive = False), default = 'png')
@click.pass_context
def box_plot(ctx, column, output, format):
    """Create a box plot from a COLUMN of data in the CSV-FILE.

    COLUMN is the name of a single column in the CSV-FILE. This column
    is used in the univariate plot.

    OUTPUT is the name of the output file.

    FORMAT is the type of output format.
    """

    plot = BoxPlot(figsize = FIGURE_SIZE, tight_layout = True)
    plot.plot(ctx.obj['data frame'], column)
    plot.save(output, format)

This still follows the setup, plot and write the data pattern that I liked from the last post but introduces BoxPlot to manage the box plot lifecycle. The BoxPlot class looks like this:

import matplotlib.pyplot as plt

class BoxPlot(object):
    def __init__(self, **kwargs):
        """Construct the box plot object."""
        self._figure, self._axis = plt.subplots(**kwargs)

    def __del__(self):
        """Destroy the figure once it's finished being used."""
        plt.close(self._figure)

    def plot(self, data_set, column_name):
        """Create a univariate box plot of the data set's column."""
        self._axis.set_title('Box Plot')
        self._axis.set_ylabel('Data Set')
        self._axis.set_xlabel(column_name)
        self._axis.boxplot(data_set[column_name], vert=False)

    def save(self, file_pointer, file_format: str):
        """Write the box plot to the specified file or file-like object."""
        plt.savefig(file_pointer, format = file_format)

What works for me here is that the box plot has become the focus of the domain, not the Matplotlib figure and axis. Identifying the domain is the same challenge I described in Better Class Design. That’s the hard part of getting this correct–finding the right concept in the domain.

I still have a lot of duplicate code–the __init__(), __del()__ and save() methods are the same for all of my plot classes. The next refactor will clean that up. Every function like box_plot() and every class like BoxPlot follows the same pattern. This tells me there is still a couple of abstractions that are missing from the implementation, but that I’m on the right track.

Where this last refactor has paid off is the elimination of doWritePngFile(). There was an excessive amount of layering there that was overly focused on writing to just a file. The save() API is much more general, taking only a file-type object and a format string supported Matplotlib’s savefig(). This was achieved that the expense of pushing the format parameter through to the command-line.

Another important point is that this refactor moved me away from the recurring problem of the Data Class. Another indication that this last refactor was a positive step forward.

August 29, 2021

Graphical Methods for Statistical Analysis

I recently read Graphical Methods in Statistical Analysis by Lincoln E. Moses. A nice introduction that takes you through several different ways to present data.

I wrote a program that takes a look at some of the univariate methods described in the paper. In my case, I am interested in plotting a single variable from a data set–so the paper’s description of multiple samples isn’t interesting.

What came out of the development of this tool was a collection of implementation and design challenges. Basically, I wrote code and didn’t like the result. Explored the problem space a little and still didn’t like the implementation.

I settled on the following implementations.

@gm.command()
@click.argument('column')
@click.pass_context
def histogram(ctx, column):
    fig, ax1 = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax1.set_title('Histogram')
    ax1.set_ylabel('Frequency')
    ax1.set_xlabel(column)

    ax1.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')

    ax2 = ax1.twinx()
    ax2.set_ylabel('Relative Frequency')
    ax2.tick_params(axis='y')

    adjust_y_axis = lambda x, pos: "{:0.4f}".format((x / len(ctx.obj['data frame'][column])))
    ax2.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')
    ax2.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))

    doWritePngFile(ctx.obj['output file'] + '_' + 'histogram', doSaveFigure(fig))

and

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

The implementation relies upon the Click and [Matplotlib][https://matplotlib.org] Python packages.

What I like about the box_plot() and histogram() functions is their form:

  • setup the figure and axes
  • plot the data
  • write the data

This pattern of setup, plot, write repeats itself in every method that generates a plot.

What I don’t like is the code duplication in these functions. Setup and write differ only by their parameters. Only plot is unique.

The challenge is to find a way to replicate the pattern and eliminate the duplication. I want to improve the code and not complicate it. I want to improve the code and enhance it’s readability.

The obvious change is:

``python def createFigure(): return plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

def setFigureAttributes(title, x_label, y_label): ax.set_title(‘Box Plot’) ax.set_xlabel(column) ax.set_ylabel(‘Data Set’)


This eliminates the duplication.
It's the same solution I came up with for the `doWritePngFile()` function.
Still, I don't like it.

I don't like it because `createFigure()` assumes there is only ever one subplot.
If I rewrite this to:

``python
def createFigure(row, column):
    return plt.subplots(row, column, figsize = FIGURE_SIZE, tight_layout = True)

The basically, I’m writing a wrapper around plt.subplots(). Same thing I did with setFigureAttributes().

The problem in each case is that while I’m eliminating duplication I’m increasing cognitive load. In order to understand the wrapper, I need to understand what the wrapper calls. No value in this.

A refactor:

class Plot(object):
    def __init__(self, title, x_label, y_label):
        self._title = title
        ...
  
     def save(self, fig, file_name):
        doWritePngFile(file_name, doSaveFigure(fig))
         

class BoxPlot(Plot):
    def __init__(self, title, x_label, y_label):
        super().__init__(self, title, x_label, y_label))
        self._figure, self._ax = plt.subplots(...)

    def plot(self, data):
        ax.boxplot(data, vert=False)

    def save(self, file_name):
        super().save(self._figure, file_name)

This isn’t any better because of the number of fact that the Plot class contains a bunch of data that is doesn’t use. The Boxplot class would need it to set up the axes. Dumping this data back into Boxplot re-introduces the repetition of setup between different plots. If I do this, then all I have is the Plot class that could has an implementation for save() but nothing else. This approach is the wrong abstraction.

Another refactor:

class Plot(object):
    def __init__(self, figure, axes, formatter):
        self._figure = figure
        self._axes = axes
        self._formatter = formatter

     def plot(self):
        pass

     def save(name):
         formatter(name)

class BoxPlot(Plot):
    def __init__(self, formatter):
        super().__init__(plt.subplots(...))

Still no good. You can’t plot without know the type of plot and it seems silly for BoxPlot to forward a function to Plot so that it gets used there. Wrong abstraction again, although the formatter does provide some insight on how to save the plot in different formats.

Anyway, where I ended up with this is that I don’t know enough about the abstraction I’m looking for to do anything meaningful. I’m going to leave the duplication and see how it turns out.

Part of the argument for leaving this duplication is that I’m not sure if the program is going to need to use multiple subplots and how. I don’t want to trick myself into building an abstraction that gets in the way of future enhancements. This is a different look at Practical Application of DRY.

July 2, 2021

Developer Traps: Needless Layering

  —A detailed analysis of how I code provided insight on poor practice.

I started to perform a deeper analysis of my approach to coding. It’s proven insightful.

My approach to studying how I code was motivated by a talk by Kent Beck on Responsive Design. This talk is insightful overall, but what motivated my analysis was Kent’s discussion on how he prepared for his Smalltalk patterns book.

To develop the patterns in the book he spent time writing down what he was doing when he coded and providing explanations for why he was doing those things. Basically, he was trying to capture the thought process of an expert Smalltalk developer in a way that could be easily explained to others.

What struck me in this discussion, what that anyone could repeat this exercise and benefit. So I wrote down what I did when I coded with the objective of eliminating poor practice.

Here is an example.

I’m in the process of writing an application using Qt that relies upon an SQLite database. One of the challenges was that I wanted to create test cases that automatically create and remove database files.

Seems simple enough. My first attempt looks like this. (I’m using Qt and Google Test.)

class FileTemplateManager
{
public:
    FileTemplateManager(const char * const t): file_name_template(t) {
        if(!file_name_template)
           throw "file template error";
    }

    ~FileTemplateManager() {
    }

    const char *get_template() const {
        return file_name_template;
    }

private:
    const char * const file_name_template;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileTemplateManager test_file_path("/tmp/database.XXXXXX");;
    QString database(test_file_path.get_template());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

The point of this attempt was to ensure that I could create a database with a fixed path. That database is located at /tmp/database.XXXXXX. The format of the database path is motivated by my desire to use man 3 mktemp to generate a unique file name. I’m planning on using mktemp because I couldn’t find a generic way to generate a random file name that I could provide SQLite to create the database. (Qt has a function for this but the name isn’t generated until you open the file.)

The entire focus of the abstraction around temporary files is the management of the template string. Initially, I liked this code because it worked, it was clean and the FileTemplateManager did only one thing.

The second attempt looks like this.

class FileTemplateManager
{
public:
    FileTemplateManager(const char * const t): file_name_template(t) {
        if(!file_name_template)
           throw "file template error";
    }

    ~FileTemplateManager() {
    }

    const char *get_template() const {
        return file_name_template;
    }

private:
    const char * const file_name_template;
};

class FileManager
{
public:
    FileManager(const FileTemplateManager& t): file_name_template(t) {
        file_name = new char[strlen(file_name_template.get_template()) + 1];
        strcpy(file_name, file_name_template.get_template());
        mktemp(file_name);
    }

    const char * const get_name() const {
        return file_name;
    }

    ~FileManager() {
        if(std::remove(file_name) != 0)
            std::perror( "Error deleting file" );
        delete file_name;
    }

private:
    FileTemplateManager file_name_template;
    char *file_name;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileManager test_file_path(FileTemplateManager("/tmp/database.XXXXXX"));
    QString database(test_file_path.get_name());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

Introduced the FileManager abstraction to manage the creation and deletion of the file using RAII. There is some nice stuff here: FileManager is blissfully ignorant of the details of FileTemplateManager and the two play nicely together. mktemp gets introduced and the whole thing works very nicely.

The fourth attempt looks like this.

class FileManager
{
public:
    FileManager(const char * const t) {
        file_name = new char[strlen(t) + 1];
        strcpy(file_name, t);
        if(0 == strlen(mktemp(file_name))) {
            throw "cannot find a unique file name";
        }

    }

    const char * const get_name() const {
        return file_name;
    }

    ~FileManager() {
        if(std::remove(file_name) != 0)
            std::perror( "Error deleting file" );
        delete file_name;
    }

private:
    char *file_name;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileManager test_file_path("/tmp/database.XXXXXX");
    QString database(test_file_path.get_name());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

FileTemplateManager turned out to be complete nonsense. It’s not needed because the string it managed was corrupt or invalid only if there was programmer error. Managing this at runtime was a waste.

FileTemplateManager is also the wrong abstraction. The goal was a unique filename to use in the test so that I could ensure that the database file name was unique. It did nothing to support that.

Elminating FileTemplateManager is best because the errors that mktemp returns occur when there are no ‘X’s in the file name and when it can’t create a unique file name.

The first is a programming error and the second is a runtime error. The programming error is handled by ignoring the NULL return from mktemp whenever there are no X’s in the file name. This will be caught when you send NULL to strlen().

The second requires a check for a zero-length (empty) string. Whenever this occurs, FileManager cannot fulfill it’s contract so it throws an exception.

Nice.

A better path to this implementation would have used FileManager to return the provided file name in the first iteration and deal with mktemp in the second. A closer look at the nature of the errors returned by mktemp and their implications on the context mktemp was going to be used in might have paved the way for this.

June 3, 2021

Pareto Charts and Retrospective

  —A look at qunatifying your retrospective data.

I’ve been engaged in retrospective for multiple teams and have always run them with the same principles:

  • create an environment where people are free to discuss the good and poor outcomes.
  • review our current best practice and revise if we can identify mitigations to prevent poor outcomes.
  • drive towards an action or decision on each topic discussed.

The most recent set of changes to our retrospective are discussed in Changing Defect Management to include Root Cause. Recently, I was told that I’m doing retrospective wrong.

The argument was that we shouldn’t be fixing anything without a Pareto chart of the issues. I discuss Pareto Charts in several articles:

This article explores the relationship between the Pareto chart and a retrospective. It also considers the value of small wins and frankly, makes an argument that the current structure of the retrospective is better.

The chief characteristics of retrospective is that the team reflects on their performance over the last sprint and seeks to identify things they should

  • start,
  • stop and
  • continue doing.

A retrospective includes consideration of

  • processes,
  • people,
  • relationships and
  • tools.

A retrospective actions items immediately so corrections can begin during the next sprint.

A Pareto chart is a tool for statistical process control among a large set of quality factors. I’ll define large as statistically significant.

The debate I entered into was whether the immediate correction or a Pareto chart was the correct way to go to improve quality. The argument I made was that because of the way the Pareto is structured, all I have to do is monitor the issues that come in through retrospective and track trends. (I was monitoring issues and had the data to prove it.) There is no difference here, except that I don’t wait for a period of time to act on the information.

Put simply, if the same issues keep coming up in retrospective, then they will get actioned. The more frequently they come up, the more frequently they get actioned. All the Pareto will show is that we actioned the correct issues.

So I ask you, gentle reader, what’s the difference?

Well, there is one difference. The timing of events could skew the frequency of events so that in theory I could act on issues that are less important because they occur less frequently. The key issue here is the time frame for data collection. And guess what? The Pareto will show the same thing.

Well, there is a second difference. You can argue that you need a statistically significant number of samples in order to identify the correct action to focus on. I can buy that, but then isn’t the argument that something is better than nothing while you build up this sample? That’s an argument for small wins.

So I’m back to where I started. There is nothing wrong with the recommended actions in the Scrum Guide to action retrospective items. Tracking the data as it comes in through retrospective is a great idea. Not being able to generate a statistically significant data set isn’t a problem if you do it intelligently. Take advantage of small wins.

May 5, 2021

Agile Release Trains

  —A look at Agile Release Trains.

A collegue recently provided me a link to Agile Release Trains (ARTs). It’s not an entirely new concept to me, but one that I’ve never had the opportunity to put into practice. ARTs are part of SAFe.

I view the release train model as separating scope from timeline–the idea that if your scope misses the nth train it can go into the nth+1 train. It’s a simple idea with possibly profound implications. SAFe uses ARTs as a key delivery vechicle. SAFe is much, much broader in scope.

To properly understand SAFe, take a look at the SAFe Principles.

A claim in Agile Release Trains is that ARTs are cross-functional. ARTs have all the capabilities—software, hardware, firmware—needed to define, implement, test, deploy, release, and where applicable, operate solutions.

ARTs operate on a set of common principles.

  • A fixed schedule as determined by the program increment cadence. All agile teams use the same cadence. A program increment has a fixed length typically 8-12 weeks in duration.
  • A system demo of each system increment every 2 weeks. A final system demo occurs at the end of each program increment. That demo includes the inspect and adapt activity.
  • Has a known velocity for the program increment.
  • Rely upon agile teams.
  • Rely upon dedicated people who support the ART, regardless of their functional reporting structure.
  • Rely upon face-to-face program increment planning.
  • Provide a buffer for innovation and planning including dedicated time for
    • program increment planning,
    • innovation,
    • continuing education and
    • infrastructure work.
  • Rely upon develop on cadance and release on demand, where demand is driven by a customer. ARTs apply cadence and synchronization to address the variability of Research and Development. Releasing is typically decoupled from the development cadence.

These principles interplay like this.

Agile Release Trains
Agile Release Trains

ARTs have some parallels with Scrum. Indeed, teams involved in ARTs may use Scrum.

The question is can ARTs be accomplished without SAFe? The short answer is they cannot.

The starting point for that question is Essential SAFe. You cannot separate ARTs from SAFe because ARTs are part of the key ingredients of SAFe. Those ingredients are:

  1. Lean–Agile Principles
  2. Real Agile Teams and Trains
  3. Cadence and Synchronization
  4. Program Increment Planning
  5. DevOps and Releasability
  6. System Demo
  7. Inspect and Adapt
  8. Architectural Runway
  9. Innovation and Planning Iteration
  10. Lean–Agile Leadership

If you want to use ARTs without SAFe you are really interested in a method like Scrum, Extreme Programming, or Kanban. The rationale for using ARTs is that they scale. And that’s the whole point fo SAFe: developing enterprise-scale solutions. The methods listed are focus on individual teams.