Navigating and understanding large code bases

You’ve just landed your new dream job and you’re given your first issue to fix in a gigantic code base. Or maybe you’re digging into a library to understand what the hell is going on in your app. Sometimes it can feel a bit overwhelming. Where do you start ?

I’m personally kind of addicted to understanding the libraries I use. In the past few months I’ve read a bunch of open source stuff out there but also a lot of code at Shopify, where we have a huge Rails code base. I like to think that I’m getting slightly faster at understanding code and I’ve developed some kind of system for how I approach a complicated piece of code. I thought I’d formalize it in this blog post. Let’s get to it!

Disclaimer: These are things that work well for me but everybody learns in a different way, so these tricks might be completely useless for you but maybe they’ll help you!

1. Always Be Writing

The most important thing for me is to write down what I find and learn. So many times I’ll go through a codebase and think I understand it, only to come back to it 2 weeks later and having to learn things again because I forgot.

I personally use private gists for that. My latest example is probably from my dive into RPM, New Relic’s Ruby Client. My notes looked like this.

As you can see hey don’t have to be written down in a nice way. It’s really just for you; to remember and visualize it better.

2. Find an entry point

I like to think about a code base like some kind of directed graph. You can often pick a function call and follow it through edges until you can’t continue anymore. But which path should you explore? I like to pick an entry point into that graph that makes sense. Usually, if it’s a library, I would pick the main public API method and follow the path.

For example, when I was diving into Relay’s Codebase I picked RelayRenderer as the main entry point and followed along to figure out how Relay was querying a GraphQL endpoint.

3. Prune branches

Every time we hit a new Graph “node” which would for example be the body of a method, we usually end up expanding our choice of paths. For example, take this fake method as an example:

function getDataFromAPI(params, apiEndpoint) {
  if (params.something) {
    params = getNewParamsFromParams(params);
  }

  if (validateApiEndpoint(apiEndpoint)){
    queryAPI(params, apiEndpoint);
  } else {
    handleApiEndpointInvalid();
  }
}

There’s quite a few things happening in this method. Here’s what our graph looks like for this region.

                       +------------------+
                       |                  |
                       |  getDataFromAPI  |
                       |                  |
                       +--+------+-----+--+
                          |      |     |
                          |      |     |
                          |      |     |
                          |      |     |
                          |      |     |
        +-----------------+      |     +----------------+
        |                        |                      |
        |                        |                      |
        v                        v                      v
getNewParamsFromParams  handleApiEndpointInvalid     queryAPI

We’ve got 4 choices to explore now.

  • A: We could try and inspect what getNewParamsFromParams does, and why it’s called when params.something is present.
  • B: We could check out what happens when the apiEndpoint is invalid.
  • C: We could see what happens in queryAPI
  • D: Explore all the things!

Option D is rarely a good choice for me for a few reasons. First, we just added 3 new paths to explore. The number of nodes to understand just grew by a lot, and we have no way of knowing how deep we’ll have to search before we come back to getDataFromAPI. The second reason I don’t like option D is that not all code paths are important for us, especially on a first read.

In this case, Option C would make the most sense. We have to remember what we were searching for in the first place: How is data fetched from that API. Error handling can be understood after, and edge cases don’t matter a this point. We’re looking a general understanding of the code base.

Coming back to point #1, I always document the path I take by writing the file name I’m exploring, the class I’m in, the method called, and a summary of what is done in there.

To summarize this point: Aggressively try to prune certain paths from your traversal. Avoid expanding in breath, focus on depth first.

In reality, what we’re doing is pretty close to a Best-First Search, where the heuristic function is decided by your brain. You have to guess which branch will lead you to your goal of understanding a particular piece of code.

4. Multiple passes

In point 3, I suggested you aggressively cut “useless” branches that won’t lead you to what you’re really interested in a method. In reality, I like to do multiple “passes” on a code base. On the first pass, I’ll cut a lot of corners, cut almost all branches except the main one.

On the subsequent passes, I’ll start going into some of those branches. Since we already have a general understanding of the code base, these branches start to make more sense to us.

5. Don’t use if you don’t understand

I’ll admit that point is a bit extreme and even I don’t always follow it 100%. But just trying to follow it can yield awesome results: try to never use a Class, Function, Library, API, without knowing a bit of it’s internals.

In theory, the interface is enough to write great software. But in my experience, for personal development, trying to dive into everything you use helps you learn much faster.

Bonus: Time

Time Gif

Understanding complex code bases has no shortcuts. These tricks help me make sense out of them much faster, but in the end, you really just need to put time aside to really understand what you work on or what you use.

Hope these tricks will help you! If not, please share what has worked for you! As always, you can find me on Twitter and Github!

Go back to Recent Posts ✍️


READ THIS NEXT:

A quick dive into how relay fetches GraphQL queries

For my own curiosity I’ve recently decided to dive into Relay’s source to understand how it actually fetches queries and caches them. I thought I’d share the notes I took from my adventure into Relay...