How to solve problems instead of symptoms

If I had to pick one skill that I believe can impress less experienced developers the most then I would go with my ability to figure out the root cause of a problem. At first, people thought that I would know all code in our codebase by heart and actually know what was going on. But my ability to do this even when I have never seen or worked with the code that breaks disproved that theory. If my memory is not the driving factor behind this skill, then what is?

TL;DR

I can figure out problems faster than others because I

make sure that I can consistently reproduce a problem before I start looking for the source,
know that the place where the problem surfaces is not necessarily where the problem should be fixed, and
do not guess but form hypotheses and try to disprove them constantly

Can you reproduce this?

"Can you show me your reproduction steps?" will most likely be my first question if you would ask me to help you to figure out a problem. Why? If you can't consistently reproduce a problem then there is a chance you don't understand it enough yet. And if you don't understand what causes the problem then it becomes much harder to isolate the root cause or to come up with a fix.

With reproduction steps, you are making sure that

you know enough about the context of a problem to replicate it
reviewers have the means to verify whether your fix works
you already have a rough idea about what a test case might look like
which components are part of the problem and which definitively aren't

It might seem cumbersome but every minute you spend on understanding the problem before you start fixing it is well spent. When you don't go through this exercise then it becomes harder to verify whether the actions you're taking will have an impact or not.

The error isn't where the code breaks

The next trap developers sometimes run into is "fixing" the problem where the code breaks. This is sometimes also referred to as "fixing a symptom" and not "fixing the root cause". Let's look at an example.

function fullName(user) {
  return `${user.firstName} ${user.lastName}`
}

fullName(null)

This code will throw an error because null does not have the properties firstName and lastName. I would even claim that almost everyone would jump to the conclusion that fullName should not be called with null but solely with valid user objects¹.

Real-world applications often aren't as simple as this code snippet. When applications grow more complex it can be harder to spot the obvious errors as we can do here. If we extend the example a little bit and add a component that makes use of the fullName function it already becomes more tricky.

import { fullName } from "./userUtils"

function UserProfile({ id }) {
  const [, user] = useFetchUser(id)

  return <>Hello {fullName(user)}</>
}

By adding more complexity to our application we need to take a closer look to determine which part isn't behaving in the correct way. My assumption is that this is what can put less experienced developers off. The error you see in your browser will still happen inside the fullName method. A fix that I've seen a lot in these situations would be to change the code to not throw any more in this particular situation.

function fullName(user) {
  if (user == null) {
    return ""
  }

  return `${user.firstName} ${user.lastName}`
}

Admittedly, the application won't break anymore. What's the issue then? With this solution we haven't fixed the issue, we have hidden the issue. The problem is still there, it only does not surface at this place anymore.

How can we find the correct spot for the fix?

I would argue that this is where proper reproduction steps help you a lot. For this particular bug the repro steps could be:

Open the application
Simulate a network outage (set browser to "Offline" in the developer tools)
Open the user profile

Expected: The user profile tells me it cannot load.

Observed: The page crashes.

Why does this help? Because it gives us context! We know where the error surfaces (i.e. in the fullName method) and when it happens (when there is no internet connection). When we're debugging the code looking for the problem we can ignore code paths that have either nothing to do with network connections or do not include the fullName method. This could reduce the number of possible code paths that might contain the problem a lot.

Do you now spot something when you have another look at the component I showed you earlier?

import { fullName } from "./userUtils"

function UserProfile({ id }) {
  const [, user] = useFetchUser(id)

  return <>Hello {fullName(user)}</>
}

Did you notice that the useFetchUser hook returns a tuple and that we're ignoring the first part of it? Since useFetchUser sounds a lot like network interaction and we know that has something to do with our problem I would dig deeper into that method.

// Note: this is not a real hook and you can't implement it this way
// but to make this example not more complicated let's assume the
// below code would synchronously fetch a user in a way that works
// nicely with react
function useFetchUser(id) {
  try {
    const user = syncFetch(`/user/${id}`)

    return [null, user]
  } catch (error) {
    return [error, null]
  }
}

As it turns out the part of the tuple that our code ignores indicates whether we could fetch the user or not (i.e. whether an error happened while we tried or not). We've found the actual problem! What was missing is not a null check in the fullName method but proper error handling in the UserProfile component.

import { fullName } from "./userUtils"

function UserProfile({ id }) {
  const [error, user] = useFetchUser(id)

  if (error) {
    return <>Could not load the user!</>
  }

  return <>Hello {fullName(user)}</>
}

My advice is to always think about how sensible the guards are that you add to your components or methods. In this example, I would have asked "How much sense does it make that the fullName method handles null as an input?". If there is a logical reason for that then this is totally fine. Your fix might be in the right place! But if it does not make too much sense then you might want to take a step back and revisit whether you might be fixing a symptom and not the problem itself. Tag along the path from what the user sees to where the code broke and ask the same question at every level. Repeat until you've found the place that contains the code path that breaks and which handles the part of the behavior that you know from the reproduction steps.

The 5 Whys

I almost forgot to include this method because once you get used to doing this it becomes muscle memory. The 5 whys is a method of literally asking "Why?" five times in a row. This may sound silly but it helps to get to the bottom of a problem in a lot of cases. In our example this could look like:

Why does fullName break?: Because user is null.
Why is user null?: Because it is passed as a null reference from UserProfile.
Why does UserProfile pass it as a null reference?: Because network errors aren't handled properly.

In this case, the third why let us to where the problem is. This method, by the way, does not only work for finding problems in code.

Don't guess, experiment

A thing that I don't like about posts like this one is the oversimplification. No application looks like the example I've shown to you in the previous section. Even though the advice I'm giving you might sound easy to apply in your daily work it might be much harder.

We're now going to have a look at how you can test whether you are on the right path to finding the solution or not.

If you would call me to your desk I would probably ask you "What have you tried so far?". I'll follow up with a "What did you learn?".

The first question should be relatively easy to answer. You probably tried a lot of things. Comment a piece of code here. Change an if-statement there.

But if you can answer the second question only with "None of what I tried fixes the problem" then you've merely been guessing around and this is not likely to get you any closer to a solution. Guessing is the brute-force method to problem-solving². If you try long enough you will, eventually, find a solution that works. However, if you want to become better (and faster) at problem-solving then you need to learn how to experiment.

I can't claim that I invented this technique. This is how scientists work all day³. The two main ingredients for a good experiment are:

a hypothesis and
an experiment to disprove the hypothesis

It must be possible to disprove any hypothesis you come up with because otherwise, you'll never know whether it's false. Notice how this is not aimed at finding the right hypothesis but getting rid of the false ones.

Imagine we got a bug report that states that user avatars are not rendered with the correct colors. You have found the Avatar component and see that it uses another component called ColorfulBox. I would suggest that before you start digging into the internals of the ColorfulBox component (which might take considerable time) you run an experiment.

import { ColorfulBox } from "design-system"

function Avatar({ user }) {
  return <ColorfulBox color={user.color} />
}

Hypothesis: The ColorfulBox component has no flaw. If I change the color prop to red I expect to see a red box.

This is a good hypothesis because it helps us to eliminate a possible source of the problem. It can also be disproven. If we set the color prop to red and the box does not change its color then we know we need to keep digging. But if it does then we can look for the problem elsewhere.

Experiments help you to get into a structured way of working. You can use them to narrow down the problem space without digging through thousands of lines of code. I always tell developers that they don't need to understand all possible code paths when they try to fix a bug. They solely need to identify and understand the one path that is affected.

To catch these kinds of bugs you probably want to use Flow or Typescript. I've chosen this example nonetheless so that I did not have to create a more complex example. ↩
I sometimes joke about the Feynman Algorithm for problem-solving. Don't take me too seriously on that one. ↩
It got pointed out to me that not all scientists work like that all day and that there are all sorts of tricks you can do to make your results fit a hypothesis. If you're speaking German this video explains it nicely. ↩

TL;DR< ">

Can you reproduce this?< ">

The error isn't where the code breaks< ">

How can we find the correct spot for the fix?<>>

The 5 Whys<>>

Don't guess, experiment< ">

Footnotes< ">