September 27, 2021

Graphical Methods for Statistical Analysis (Part II)

In Graphical Methods, I described a problem I was having with an implementation. Basically, I was having difficulty in finding what felt like the right abstraction for a collection of methods that looked like this:

@gm.command()
@click.argument('column')
@click.pass_context
def box_plot(ctx, column):

    fig, ax = plt.subplots(1, 1, figsize = FIGURE_SIZE, tight_layout = True)

    ax.set_title('Box Plot')
    ax.set_xlabel(column)
    ax.set_ylabel('Data Set')

    ax.boxplot(ctx.obj['data frame'][column], vert=False)

    doWritePngFile(ctx.obj['output file'] + '_' + 'box_plot', doSaveFigure(fig))

I ended up with the following:

@gm.command()
@click.argument('column')
@click.argument('output', type=click.File('wb'))
@click.option('--format', type=click.Choice([ 'png', ], case_sensitive = False), default = 'png')
@click.pass_context
def box_plot(ctx, column, output, format):
    """Create a box plot from a COLUMN of data in the CSV-FILE.

    COLUMN is the name of a single column in the CSV-FILE. This column
    is used in the univariate plot.

    OUTPUT is the name of the output file.

    FORMAT is the type of output format.
    """

    plot = BoxPlot(figsize = FIGURE_SIZE, tight_layout = True)
    plot.plot(ctx.obj['data frame'], column)
    plot.save(output, format)

This still follows the setup, plot and write the data pattern that I liked from the last post but introduces BoxPlot to manage the box plot lifecycle. The BoxPlot class looks like this:

import matplotlib.pyplot as plt

class BoxPlot(object):
    def __init__(self, **kwargs):
        """Construct the box plot object."""
        self._figure, self._axis = plt.subplots(**kwargs)

    def __del__(self):
        """Destroy the figure once it's finished being used."""
        plt.close(self._figure)

    def plot(self, data_set, column_name):
        """Create a univariate box plot of the data set's column."""
        self._axis.set_title('Box Plot')
        self._axis.set_ylabel('Data Set')
        self._axis.set_xlabel(column_name)
        self._axis.boxplot(data_set[column_name], vert=False)

    def save(self, file_pointer, file_format: str):
        """Write the box plot to the specified file or file-like object."""
        plt.savefig(file_pointer, format = file_format)

What works for me here is that the box plot has become the focus of the domain, not the Matplotlib figure and axis. Identifying the domain is the same challenge I described in Better Class Design. That’s the hard part of getting this correct–finding the right concept in the domain.

I still have a lot of duplicate code–the __init__(), __del()__ and save() methods are the same for all of my plot classes. The next refactor will clean that up. Every function like box_plot() and every class like BoxPlot follows the same pattern. This tells me there is still a couple of abstractions that are missing from the implementation, but that I’m on the right track.

Where this last refactor has paid off is the elimination of doWritePngFile(). There was an excessive amount of layering there that was overly focused on writing to just a file. The save() API is much more general, taking only a file-type object and a format string supported Matplotlib’s savefig(). This was achieved that the expense of pushing the format parameter through to the command-line.

Another important point is that this refactor moved me away from the recurring problem of the Data Class. Another indication that this last refactor was a positive step forward.

comments powered by Disqus