May 17, 2022

Don't Repeat Yourself (DRY)

  —A Domain-Based Look at Don't Repeat Yourself

In a Practical Application of DRY, I discuss what Sandi Metz had to say about code duplication and the cost of using the wrong abstraction. Recently, I exchanged a few tweets with Steve Streeting (@stevestreeting) about an observation he made in relation to code duplication. That discussion prompted me to revisit DRY and this an attempt to clairify my thinking on it.

Steve challenged the wisdom of seeking a unified vision of a system to support two tasks instead of accepting some code duplication between different systems when those systems could do these tasks better.

DRY is domain based (i.e., has a bounded context), but influences structure (i.e., prevents code duplication). Its goal is to retain knowlege in a meaningful way. Meaningful in the sense it is

  • singluar (consists of one concept),
  • unambiguous (can be understood in relation to other concepts),
  • authoritiative (is a resource for containing knowledge on the concept) and
  • representation (has an appropriate structure for the concept).

I looked at that the notion of authoritative and thought a better word to use is definitive. Then I rejected definitive in favor of authoritative because:

Definitive implies complete knowledge. Authoritative is better because it implies there is a single source of this knowledge and supports the notion of the development of a shared resource to store, reflect upon and revise that knowledge.

DRY seeks to create an authoritative source of knowledge for a concept. That source may or may not be definitive because the information it captures may be incomplete.

So DRY fuses two concepts together: the collection of knowledge and its representation. Breaking it into its base forms you get:

  • Single Point of Truth (SPOT) is about a singular, unambiguous and authoritative source of knowledge. It applies to the domain the knowledge comes from.

  • Once And Only Once (OAOO) is about representation. It applies structure to knowledge. Since OAOO comes into existence via refactoring it is also the process by which incremental improvement occurs.1

Importantly, SPOT is a knowledge activity that identifies and organizes information, whereas OAOO is a structural activity that incrementally improves (but does not change) the implementation.

DRY effectively organizes a domain. This begs the question: what is a domain? I think about it this way:

  • Domain code is the representation of the domain in the source code.2
  • Domain knowledge is the collection of knowledge that relies on domain experts to develop and help express it.

The two overlap. In fact, code might be reused in multiple knowledge domains. For example, the same mathematical function library might be used in a physics and a financial application. The knowledge domains for these applications are likely to differ significantly and are unlikely to be shared.

The converse is also true: the physics application will generate libraries, say on particle physics, that have no meaning in the financial application.

Back to what Steve had to say. To paraphrase, he recommends the rule of three and advises generalizing the lowest levels first and not to jump in with generalizations at the higher levels too soon (or at all). He warns it’s common to see some common implementation details, try to force higher level concepts together and then watch things fall apart as they diverge.

His original statement refers to codebase, tasks and systems. So

codebase == implementation and structuring it is what OAOO is for.

system == knowledge and structuring it is what SPOT is for.

DRY is about organizing knowledge for a system to support a codebase.

Steve also talks about the goal of a unified vision for this system and how it’s difficult to achieve in practice. In his example, the systems are related, but cannot be unified without adding complexity and reducing the performance of the tasks. He advises against going after a unified vision in this example.

I asked Steve for insight on questions to ask that lead to better decision making. I can apply DRY to his answer, particularly if I keep SPOT and OAOO top of mind.

The divergence of higher-level concepts represents a misunderstanding of SPOT. That is, you coupled different domain concepts and took different points of truth and munged them together. You failed to properly recognize the oneness of those concepts.

The commonality of lower level concepts represents an execution of OAOO. That is, refactor the oneness at the edges of the systems, but don’t get carried away.

DRY is about creating authoritative sources of knowledge. Authoritative might be definitive but it needn’t be. The role of DRY then is help guide you through knowledge acquisition and organization as you learn and as the system evolves.

If you don’t Steve provides a heuristic so you know how long to delay these decisions. (In reality, it’s not the count that matters. It’s the knowing. That’s the hard part.)

So an answer on what questions to ask about code duplication in the higher level concepts is:

  1. Ask if you know enough about the two SPOTs (concepts) containing this code.

    The emphasis is on understanding the concepts. If they differ the code might not be duplicated. It may just be a temporary property of the implementation.

  2. Recognize that DRY is about creating an authoritative resource for these concepts.

    Authoritative means its singular, but it may not be complete. See what you do to determine if these concepts are in fact similar.

An answer on what questions to ask about code duplication in the lowest level concepts is to use the rule of three and apply OAOO.

Thanks to Steve Streeting for sharing his insight.

Footnotes

1: I’m using refactoring in the same way Martin Fowler uses it: no functional changes occuring during refactor and refactor is done only when supported by tests.

2: I’m abusing this definition of domain code. The original definition refers to code substituted out so it can be mocked. The abuse is that I’m applying this notion to the implementation itself. Importantly, OAOO changes domain code.

March 20, 2022

Estimate Bias

  —A Look at Patterns in Estimates

In Estimate Bias, I describe a method for generating buffer in a project plan using the ratio of time spent to original estimate for similar projects. Here, I look at estimate bias by focusing on trends in the original estimates.

The following plots depict the original estimates for multiple projects. The plots on the left are histograms. Those on the right are sorted and with their \(log_{10}\) values plotted.

These plots use the original estimate for each task. If you are familiar with JIRA, this is not the sum of the original estimates. The sum hides important information about how the estimates are created. If I used the sum the layers in the right column would be less pronounced or nonexistent.

Is there a difference between estimating a 10 day task or two 5 day sub-tasks? I’ll argue yes. Intuitively, it seems likely that better estimates are possible for smaller tasks, but not always.

What do these plots tell me about the quality of our estimates? They tell me there is a tendency to gravitate to a few values \(\le 0\) and \(\le 10\). This range is closer to 0.3 days through to 8 days, which is evidenced in the histograms.

This trend gets even more pronounced using plot all of the data from all projects, as shown in the plot below.

Here, it looks like the dominate plateaus are 0.5, 1, 2 and 5 days.

So these estimates tend to cluster in plateaus that tend to fall on durations of a week. Individual projects tend to have the same plateaus but it looks like these fall into 0.5, 1, 2 and 5 day ranges.

February 19, 2022

Patch Size

  —Some experience on code patch size.

I’ve had some funny experiences with a couple of positions I’ve held in relation to developers and their position on the size of commits (patches). For me, code review is one part quality activity and another part knowledge share. I prefer to frame code review as a decision on whether the patch is suitable for use in the product or not.

Most of my career has been spent working on systems software–cryptographic toolkit and then compilers. Several years ago I transitioned to application development–end user applications where the end user is an external party. Two different groups of application developers have made similar arguments against small patches. I’ve never encountered this problem with system developers.

The ask is always: code needs to be reviewed prior to merge to master and your reviewers need to be able to review it throughly. There is always someone who insists that they can’t make small patches and the arguments are always the same:

  • The patch can’t be done differently.
  • The size of the patch doesn’t matter.

The first argument is nonsense, and I attribute it to lazyness in one case and a lack of understanding of the purpose of code review in the second. I direct these people to [kernel.org][https://www.kernel.org] and ask them to tell me what they see. (Those patches are really nice, by the way.)

The second argument is only partially correct. It reflects a lack of empathy towards your reviewers.

Size is a factor when complexity comes into play. It’s a factor because large changes increase the odds of errors being introduced–whitespace changes, where the comittter just couldn’t resist making a simple change or the size of the changes are simply so large that the reviewers give up.

What’s curious the stark contrast between the approach taken by application and system developers. Application developers seem to take more risks in the changes they make and tend to do less automated testing. I don’t have an answer for why, but it’s piqued my interest.

January 21, 2022

Estimate Bias

  —A Look at Error in Estimates

I’ve been spending a lot of time on project management lately. Most of the focus is on building plans that achieve consistent delivery and known quality. I say consistent delivery because this implies predictable outcomes.

I differentiate between consistent and optimal delivery–that’s an argument similar to learning how to walk before you run. I’m just discussing walking.

One way of developing consistent delivery is to add buffer your schedules. It’s not good practice to buffer arbitrarily. Buffer should reflect factual information about the project you are running.

If I buffer, I do so transparently. This allows people to challenge the buffer.

This post is about an element of buffer I’ll call estimate bias. Estimate bias is the ratio of the sum of time spent to the sum of original estimate for a similar project. A similar project, ideally uses the same people, code base, technology and tools.

The estimate bias is a multipler for estimates on the new project–if the bias is 2 and the estimate is 5, then I create a loaded estimate of 10. It adds a lot of time to a project.

Why do I create a bias?

I love arguments about creating accurate estimates. An estimate is just a guess. A highly educated guess, but a guess nonetheless. Bias is a measure of error in estimates.

I’ve written elsewhere about no estimates. I’d liken bias to a heuristic that is part of the balance between no estimates and a prediction for a completion time. I use it in an environment firmly embeded in cost vs time.

Let’s get back to the bias: the sum of time spent divided by the sum of original estimate. I’ve struggled with this definition. I’ve settled on the notion that this is best expressed as the number of units of time spent for each unit of the original estimate.

I’ve had some people suggest this is a work ratio (time spent divided by original estimate). It is not. I agree that work ratio is a valid measure for a single task, but it’s not a good measure to apply to an entire project.

To use work ratio, you’d have to use an average of ratios. Take a momement to look at these plots of data.

Scatter Plot of Work Ratio
Scatter Plot of Work Ratio

In this example, the outliers need to remain (because they reflect real events). There needs to be a balanced way to manage the outliers with the majority of events. The average of ratios is too sensitive to outliers.

A geometric mean doesn’t seem appropriate because the relationship between time spent on different tickets is not multiplicative. It’s additive.

This leaves a weighted average and the definition I describe above. The problem with the weighted average is what is a sensible weight to provide to the ratio of 86 on this graph? I don’t even want to think about that.

Let me describe why I think that sum of time spent divided by the sum of original estimate is a resonable heuristic. Let’s say I have two tasks as follows.

Ticket Time Spent Original Estimate Work Ratio
T1 1 1000 0.001
T2 1000 1 1000
Total 1001 1001 500.0005

This looks like a scary project but it ended on time. An average of ratios would have me cost this project at 500.05 units of time when in fact it cost exactly what was predicted. Bias would be 1 which accurately reflects the error in the estimate for the project. (Clearly we have a problem estimating, but that’s a ticket level problem that the work ratio correctly identifies.)

Is bias a good idea? It is if the project, team and resources are similar. Similarity is of course an open question.

December 23, 2021

Object-Oriented Programming

  —A book review on the book of the same name.

I recently read the book Object-Oriented Programming, by Brad Cox. It does a really nice job of explaining object-oriented programming and it’s well worth the read.

The main take-away’s I got form the book were:

  • focus on the client-supplier relationship and make the client’s life as simple as you can.
  • use encapsulation to restrict the effect of change.
  • use dynamic binding to ensure extensibility of the system over time.

Two of three of these ideas seem pretty standard now (this is an old book), but the client-supplier relationship really resonated with me. It resonated with me because its a different perspective on interfaces that I think is really hard to get correct. If you use SOLID, for example, there isn’t much guidance what makes a good interface.

When I think of SOLID, I think the interface more explicit in the Open-Closed and the Liskov Substition principles. The other principles encourage encapsulation by are very general.

For example, Single Responsibility (SRP) is a property of a good class but it doesn’t say anything about the class API. Client-supplier complements SRP nicely because it forces you to extend your thinking beyond the responsibility to how the class will be used. And ultimately what the class methods should look like.

The bulk of the book focuses on Object-C and discusses Smalltalk. The explanations of Smalltalk that focus on messages and methods really add emphasis on the importance that messages and methods are actions. It’s a nice reminder to what you are trying to achieve by creating them in the first place.