August 29, 2021

Graphical Methods for Statistical Analysis

I recently read Graphical Methods in Statistical Analysis by Lincoln E. Moses. A nice introduction that takes you through several different ways to present data.

I wrote a program that takes a look at some of the univariate methods described in the paper. In my case, I am interested in plotting a single variable from a data set–so the paper’s description of multiple samples isn’t interesting.

What came out of the development of this tool was a collection of implementation and design challenges. Basically, I wrote code and didn’t like the result. Explored the problem space a little and still didn’t like the implementation.

I settled on the following implementations.

@gm.command()
@click.argument('column')
@click.pass_context
def histogram(ctx, column):
    fig, ax1 = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax1.set_title('Histogram')
    ax1.set_ylabel('Frequency')
    ax1.set_xlabel(column)

    ax1.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')

    ax2 = ax1.twinx()
    ax2.set_ylabel('Relative Frequency')
    ax2.tick_params(axis='y')

    adjust_y_axis = lambda x, pos: "{:0.4f}".format((x / len(ctx.obj['data frame'][column])))
    ax2.hist(ctx.obj['data frame'][column], bins = calculateBins(ctx.obj['data frame'][column]), color='k', ls='solid')
    ax2.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))

    doWritePngFile(ctx.obj['output file'] + '_' + 'histogram', doSaveFigure(fig))

and

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

The implementation relies upon the Click and [Matplotlib][https://matplotlib.org] Python packages.

What I like about the box_plot() and histogram() functions is their form:

setup the figure and axes
plot the data
write the data

This pattern of setup, plot, write repeats itself in every method that generates a plot.

What I don’t like is the code duplication in these functions. Setup and write differ only by their parameters. Only plot is unique.

The challenge is to find a way to replicate the pattern and eliminate the duplication. I want to improve the code and not complicate it. I want to improve the code and enhance it’s readability.

The obvious change is:

``python def createFigure(): return plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

def setFigureAttributes(title, x_label, y_label): ax.set_title(‘Box Plot’) ax.set_xlabel(column) ax.set_ylabel(‘Data Set’)

This eliminates the duplication.
It's the same solution I came up with for the `doWritePngFile()` function.
Still, I don't like it.

I don't like it because `createFigure()` assumes there is only ever one subplot.
If I rewrite this to:

``python
def createFigure(row, column):
    return plt.subplots(row, column, figsize = FIGURE_SIZE, tight_layout = True)

The basically, I’m writing a wrapper around plt.subplots(). Same thing I did with setFigureAttributes().

The problem in each case is that while I’m eliminating duplication I’m increasing cognitive load. In order to understand the wrapper, I need to understand what the wrapper calls. No value in this.

A refactor:

class Plot(object):
    def __init__(self, title, x_label, y_label):
        self._title = title
        ...
  
     def save(self, fig, file_name):
        doWritePngFile(file_name, doSaveFigure(fig))
         

class BoxPlot(Plot):
    def __init__(self, title, x_label, y_label):
        super().__init__(self, title, x_label, y_label))
        self._figure, self._ax = plt.subplots(...)

    def plot(self, data):
        ax.boxplot(data, vert=False)

    def save(self, file_name):
        super().save(self._figure, file_name)

This isn’t any better because of the number of fact that the Plot class contains a bunch of data that is doesn’t use. The Boxplot class would need it to set up the axes. Dumping this data back into Boxplot re-introduces the repetition of setup between different plots. If I do this, then all I have is the Plot class that could has an implementation for save() but nothing else. This approach is the wrong abstraction.

Another refactor:

class Plot(object):
    def __init__(self, figure, axes, formatter):
        self._figure = figure
        self._axes = axes
        self._formatter = formatter

     def plot(self):
        pass

     def save(name):
         formatter(name)

class BoxPlot(Plot):
    def __init__(self, formatter):
        super().__init__(plt.subplots(...))

Still no good. You can’t plot without know the type of plot and it seems silly for BoxPlot to forward a function to Plot so that it gets used there. Wrong abstraction again, although the formatter does provide some insight on how to save the plot in different formats.

Anyway, where I ended up with this is that I don’t know enough about the abstraction I’m looking for to do anything meaningful. I’m going to leave the duplication and see how it turns out.

Part of the argument for leaving this duplication is that I’m not sure if the program is going to need to use multiple subplots and how. I don’t want to trick myself into building an abstraction that gets in the way of future enhancements. This is a different look at Practical Application of DRY.