November 24, 2021

Graphical Methods for Statistical Analysis (Part III)

In Graphical Methods, I described an improvement I made to an implementation and showed how it was on it’s way to being better.

I ended up with the following:

import matplotlib.pyplot as plt

class Plot(object):
    def __init__(self, **kwargs):
        """Construct the plot object."""
        self._figure, self._axis = plt.subplots(**kwargs)

    def __del__(self):
        """Destroy the figure once it's finished being used."""
        plt.close(self._figure)

    def plot(self, data_set, column_name):
        """Placeholder for univariate data plot."""
        pass

    def save(self, file_pointer, file_format: str):
        """Write the histogram plot to the specified file or file-like object."""
        plt.savefig(file_pointer, format = file_format)

and

from ..plot import Plot

import matplotlib.pyplot as plt

class BoxPlot(Plot):
    def __init__(self, **kwargs):
        """Construct the box plot object."""
        super().__init__(**kwargs)

    def plot(self, data_set, column_name):
        """Create a univariate box plot of the data set's column."""
        super().plot(data_set, column_name)
        self._axis.set_title('Box Plot')
        self._axis.set_ylabel('Data Set')
        self._axis.set_xlabel(column_name)
        self._axis.boxplot(data_set[column_name], vert=False)

This is the same class hierarchy I proposed in Graphical Methods for Statistical Analysis. The difference here is that the list of arguments provided to Plot is no longer focused on plot attributes (title, labels, etc). Instead it’s focused on the construction of the figure and axes required to make the plot.

The plot attributes are delegated to the derived classes (BoxPlot in this example). A nice property of this is that the BoxPlot’s plot function can change the plot attributes without affecting anything else. This implementation also has a nice property in that changing the axes (i.e., from the default of 1) doesn’t affect the clients or the parent class. Everything relating to a box plot is in this class.

This design doesn’t currently support bivariate analysis and the plot method won’t adjust well to the integration of that requirement.

About the only thing that I debate in the implementation of BoxPlot is whether the titles and labels should be set in by the plot() method. I intend to leave them there for a couple of reasons. The distance between their use and the univariate plot is small. Extending the box plots to bivariate values is going to force me to extend the implementation in some way that I haven’t fully through through.

The only other point in the implementation is the recurrence of the pattern of setup, plot and write:

@gm.command()
@click.argument('column')
@click.argument('output', type=click.File('wb'))
@click.option('--format', type=click.Choice([ 'png', ], case_sensitive = False), default = 'png')
@click.pass_context
def box_plot(ctx, column, output, format):
    """Create a box plot from a COLUMN of data in the CSV-FILE.

    COLUMN is the name of a single column in the CSV-FILE. This column
    is used in the univariate plot.

    OUTPUT is the name of the output file.

    FORMAT is the type of output format.
    """

    plot = BoxPlot(figsize = FIGURE_SIZE, tight_layout = True)
    plot.plot(ctx.obj['data frame'], column)
    plot.save(output, format)

I haven’t convinced myself that taking advantage of the Plot base class so that a method like the following is a good idea.

def plot_it(plot: Plot):
    plot.plot(...)
    plot.save(...)

It’s tempting, but the problem is that it’s too soon. Bivariate plots are coming and it’s not clear how to change the implementation to support this. Better to wait.

September 27, 2021

Graphical Methods for Statistical Analysis (Part II)

In Graphical Methods, I described a problem I was having with an implementation. Basically, I was having difficulty in finding what felt like the right abstraction for a collection of methods that looked like this:

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

I ended up with the following:

@gm.command()
@click.argument('column')
@click.argument('output', type=click.File('wb'))
@click.option('--format', type=click.Choice([ 'png', ], case_sensitive = False), default = 'png')
@click.pass_context
def box_plot(ctx, column, output, format):
    """Create a box plot from a COLUMN of data in the CSV-FILE.

    COLUMN is the name of a single column in the CSV-FILE. This column
    is used in the univariate plot.

    OUTPUT is the name of the output file.

    FORMAT is the type of output format.
    """

    plot = BoxPlot(figsize = FIGURE_SIZE, tight_layout = True)
    plot.plot(ctx.obj['data frame'], column)
    plot.save(output, format)

This still follows the setup, plot and write the data pattern that I liked from the last post but introduces BoxPlot to manage the box plot lifecycle. The BoxPlot class looks like this:

import matplotlib.pyplot as plt

class BoxPlot(object):
    def __init__(self, **kwargs):
        """Construct the box plot object."""
        self._figure, self._axis = plt.subplots(**kwargs)

    def __del__(self):
        """Destroy the figure once it's finished being used."""
        plt.close(self._figure)

    def plot(self, data_set, column_name):
        """Create a univariate box plot of the data set's column."""
        self._axis.set_title('Box Plot')
        self._axis.set_ylabel('Data Set')
        self._axis.set_xlabel(column_name)
        self._axis.boxplot(data_set[column_name], vert=False)

    def save(self, file_pointer, file_format: str):
        """Write the box plot to the specified file or file-like object."""
        plt.savefig(file_pointer, format = file_format)

What works for me here is that the box plot has become the focus of the domain, not the Matplotlib figure and axis. Identifying the domain is the same challenge I described in Better Class Design. That’s the hard part of getting this correct–finding the right concept in the domain.

I still have a lot of duplicate code–the __init__(), __del()__ and save() methods are the same for all of my plot classes. The next refactor will clean that up. Every function like box_plot() and every class like BoxPlot follows the same pattern. This tells me there is still a couple of abstractions that are missing from the implementation, but that I’m on the right track.

Where this last refactor has paid off is the elimination of doWritePngFile(). There was an excessive amount of layering there that was overly focused on writing to just a file. The save() API is much more general, taking only a file-type object and a format string supported Matplotlib’s savefig(). This was achieved that the expense of pushing the format parameter through to the command-line.

Another important point is that this refactor moved me away from the recurring problem of the Data Class. Another indication that this last refactor was a positive step forward.

August 29, 2021

Graphical Methods for Statistical Analysis

I recently read Graphical Methods in Statistical Analysis by Lincoln E. Moses. A nice introduction that takes you through several different ways to present data.

I wrote a program that takes a look at some of the univariate methods described in the paper. In my case, I am interested in plotting a single variable from a data set–so the paper’s description of multiple samples isn’t interesting.

What came out of the development of this tool was a collection of implementation and design challenges. Basically, I wrote code and didn’t like the result. Explored the problem space a little and still didn’t like the implementation.

I settled on the following implementations.

@gm.command()
@click.argument('column')
@click.pass_context
def histogram(ctx, column):
    fig, ax1 = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax1.set_title('Histogram')
    ax1.set_ylabel('Frequency')
    ax1.set_xlabel(column)

    ax1.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')

    ax2 = ax1.twinx()
    ax2.set_ylabel('Relative Frequency')
    ax2.tick_params(axis='y')

    adjust_y_axis = lambda x, pos: "{:0.4f}".format((x / len(ctx.obj['data frame'][column])))
    ax2.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')
    ax2.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))

    doWritePngFile(ctx.obj['output file'] + '_' + 'histogram', doSaveFigure(fig))

and

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

The implementation relies upon the Click and [Matplotlib][https://matplotlib.org] Python packages.

What I like about the box_plot() and histogram() functions is their form:

  • setup the figure and axes
  • plot the data
  • write the data

This pattern of setup, plot, write repeats itself in every method that generates a plot.

What I don’t like is the code duplication in these functions. Setup and write differ only by their parameters. Only plot is unique.

The challenge is to find a way to replicate the pattern and eliminate the duplication. I want to improve the code and not complicate it. I want to improve the code and enhance it’s readability.

The obvious change is:

``python def createFigure(): return plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

def setFigureAttributes(title, x_label, y_label): ax.set_title(‘Box Plot’) ax.set_xlabel(column) ax.set_ylabel(‘Data Set’)


This eliminates the duplication.
It's the same solution I came up with for the `doWritePngFile()` function.
Still, I don't like it.

I don't like it because `createFigure()` assumes there is only ever one subplot.
If I rewrite this to:

``python
def createFigure(row, column):
    return plt.subplots(row, column, figsize = FIGURE_SIZE, tight_layout = True)

The basically, I’m writing a wrapper around plt.subplots(). Same thing I did with setFigureAttributes().

The problem in each case is that while I’m eliminating duplication I’m increasing cognitive load. In order to understand the wrapper, I need to understand what the wrapper calls. No value in this.

A refactor:

class Plot(object):
    def __init__(self, title, x_label, y_label):
        self._title = title
        ...
  
     def save(self, fig, file_name):
        doWritePngFile(file_name, doSaveFigure(fig))
         

class BoxPlot(Plot):
    def __init__(self, title, x_label, y_label):
        super().__init__(self, title, x_label, y_label))
        self._figure, self._ax = plt.subplots(...)

    def plot(self, data):
        ax.boxplot(data, vert=False)

    def save(self, file_name):
        super().save(self._figure, file_name)

This isn’t any better because of the number of fact that the Plot class contains a bunch of data that is doesn’t use. The Boxplot class would need it to set up the axes. Dumping this data back into Boxplot re-introduces the repetition of setup between different plots. If I do this, then all I have is the Plot class that could has an implementation for save() but nothing else. This approach is the wrong abstraction.

Another refactor:

class Plot(object):
    def __init__(self, figure, axes, formatter):
        self._figure = figure
        self._axes = axes
        self._formatter = formatter

     def plot(self):
        pass

     def save(name):
         formatter(name)

class BoxPlot(Plot):
    def __init__(self, formatter):
        super().__init__(plt.subplots(...))

Still no good. You can’t plot without know the type of plot and it seems silly for BoxPlot to forward a function to Plot so that it gets used there. Wrong abstraction again, although the formatter does provide some insight on how to save the plot in different formats.

Anyway, where I ended up with this is that I don’t know enough about the abstraction I’m looking for to do anything meaningful. I’m going to leave the duplication and see how it turns out.

Part of the argument for leaving this duplication is that I’m not sure if the program is going to need to use multiple subplots and how. I don’t want to trick myself into building an abstraction that gets in the way of future enhancements. This is a different look at Practical Application of DRY.

July 2, 2021

Developer Traps: Needless Layering

  —A detailed analysis of how I code provided insight on poor practice.

I started to perform a deeper analysis of my approach to coding. It’s proven insightful.

My approach to studying how I code was motivated by a talk by Kent Beck on Responsive Design. This talk is insightful overall, but what motivated my analysis was Kent’s discussion on how he prepared for his Smalltalk patterns book.

To develop the patterns in the book he spent time writing down what he was doing when he coded and providing explanations for why he was doing those things. Basically, he was trying to capture the thought process of an expert Smalltalk developer in a way that could be easily explained to others.

What struck me in this discussion, what that anyone could repeat this exercise and benefit. So I wrote down what I did when I coded with the objective of eliminating poor practice.

Here is an example.

I’m in the process of writing an application using Qt that relies upon an SQLite database. One of the challenges was that I wanted to create test cases that automatically create and remove database files.

Seems simple enough. My first attempt looks like this. (I’m using Qt and Google Test.)

class FileTemplateManager
{
public:
    FileTemplateManager(const char * const t): file_name_template(t) {
        if(!file_name_template)
           throw "file template error";
    }

    ~FileTemplateManager() {
    }

    const char *get_template() const {
        return file_name_template;
    }

private:
    const char * const file_name_template;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileTemplateManager test_file_path("/tmp/database.XXXXXX");;
    QString database(test_file_path.get_template());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

The point of this attempt was to ensure that I could create a database with a fixed path. That database is located at /tmp/database.XXXXXX. The format of the database path is motivated by my desire to use man 3 mktemp to generate a unique file name. I’m planning on using mktemp because I couldn’t find a generic way to generate a random file name that I could provide SQLite to create the database. (Qt has a function for this but the name isn’t generated until you open the file.)

The entire focus of the abstraction around temporary files is the management of the template string. Initially, I liked this code because it worked, it was clean and the FileTemplateManager did only one thing.

The second attempt looks like this.

class FileTemplateManager
{
public:
    FileTemplateManager(const char * const t): file_name_template(t) {
        if(!file_name_template)
           throw "file template error";
    }

    ~FileTemplateManager() {
    }

    const char *get_template() const {
        return file_name_template;
    }

private:
    const char * const file_name_template;
};

class FileManager
{
public:
    FileManager(const FileTemplateManager& t): file_name_template(t) {
        file_name = new char[strlen(file_name_template.get_template()) + 1];
        strcpy(file_name, file_name_template.get_template());
        mktemp(file_name);
    }

    const char * const get_name() const {
        return file_name;
    }

    ~FileManager() {
        if(std::remove(file_name) != 0)
            std::perror( "Error deleting file" );
        delete file_name;
    }

private:
    FileTemplateManager file_name_template;
    char *file_name;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileManager test_file_path(FileTemplateManager("/tmp/database.XXXXXX"));
    QString database(test_file_path.get_name());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

Introduced the FileManager abstraction to manage the creation and deletion of the file using RAII. There is some nice stuff here: FileManager is blissfully ignorant of the details of FileTemplateManager and the two play nicely together. mktemp gets introduced and the whole thing works very nicely.

The fourth attempt looks like this.

class FileManager
{
public:
    FileManager(const char * const t) {
        file_name = new char[strlen(t) + 1];
        strcpy(file_name, t);
        if(0 == strlen(mktemp(file_name))) {
            throw "cannot find a unique file name";
        }

    }

    const char * const get_name() const {
        return file_name;
    }

    ~FileManager() {
        if(std::remove(file_name) != 0)
            std::perror( "Error deleting file" );
        delete file_name;
    }

private:
    char *file_name;
};

const bool fileExists(const QString& path) {
    QFileInfo check_file(path);
    if (check_file.exists() && check_file.isFile()) {
        return true;
    }
    return false;
}

/* Check container state following initialization using default constructor.
 */
TEST(TestDatabase, Create) {
    FileManager test_file_path("/tmp/database.XXXXXX");
    QString database(test_file_path.get_name());

    create(database);
    ASSERT_TRUE(fileExists(database));
}

FileTemplateManager turned out to be complete nonsense. It’s not needed because the string it managed was corrupt or invalid only if there was programmer error. Managing this at runtime was a waste.

FileTemplateManager is also the wrong abstraction. The goal was a unique filename to use in the test so that I could ensure that the database file name was unique. It did nothing to support that.

Elminating FileTemplateManager is best because the errors that mktemp returns occur when there are no ‘X’s in the file name and when it can’t create a unique file name.

The first is a programming error and the second is a runtime error. The programming error is handled by ignoring the NULL return from mktemp whenever there are no X’s in the file name. This will be caught when you send NULL to strlen().

The second requires a check for a zero-length (empty) string. Whenever this occurs, FileManager cannot fulfill it’s contract so it throws an exception.

Nice.

A better path to this implementation would have used FileManager to return the provided file name in the first iteration and deal with mktemp in the second. A closer look at the nature of the errors returned by mktemp and their implications on the context mktemp was going to be used in might have paved the way for this.

June 3, 2021

Pareto Charts and Retrospective

  —A look at qunatifying your retrospective data.

I’ve been engaged in retrospective for multiple teams and have always run them with the same principles:

  • create an environment where people are free to discuss the good and poor outcomes.
  • review our current best practice and revise if we can identify mitigations to prevent poor outcomes.
  • drive towards an action or decision on each topic discussed.

The most recent set of changes to our retrospective are discussed in Changing Defect Management to include Root Cause. Recently, I was told that I’m doing retrospective wrong.

The argument was that we shouldn’t be fixing anything without a Pareto chart of the issues. I discuss Pareto Charts in several articles:

This article explores the relationship between the Pareto chart and a retrospective. It also considers the value of small wins and frankly, makes an argument that the current structure of the retrospective is better.

The chief characteristics of retrospective is that the team reflects on their performance over the last sprint and seeks to identify things they should

  • start,
  • stop and
  • continue doing.

A retrospective includes consideration of

  • processes,
  • people,
  • relationships and
  • tools.

A retrospective actions items immediately so corrections can begin during the next sprint.

A Pareto chart is a tool for statistical process control among a large set of quality factors. I’ll define large as statistically significant.

The debate I entered into was whether the immediate correction or a Pareto chart was the correct way to go to improve quality. The argument I made was that because of the way the Pareto is structured, all I have to do is monitor the issues that come in through retrospective and track trends. (I was monitoring issues and had the data to prove it.) There is no difference here, except that I don’t wait for a period of time to act on the information.

Put simply, if the same issues keep coming up in retrospective, then they will get actioned. The more frequently they come up, the more frequently they get actioned. All the Pareto will show is that we actioned the correct issues.

So I ask you, gentle reader, what’s the difference?

Well, there is one difference. The timing of events could skew the frequency of events so that in theory I could act on issues that are less important because they occur less frequently. The key issue here is the time frame for data collection. And guess what? The Pareto will show the same thing.

Well, there is a second difference. You can argue that you need a statistically significant number of samples in order to identify the correct action to focus on. I can buy that, but then isn’t the argument that something is better than nothing while you build up this sample? That’s an argument for small wins.

So I’m back to where I started. There is nothing wrong with the recommended actions in the Scrum Guide to action retrospective items. Tracking the data as it comes in through retrospective is a great idea. Not being able to generate a statistically significant data set isn’t a problem if you do it intelligently. Take advantage of small wins.