November 9, 2020

How to Recalculate a Spreadsheet

Let’s say I’m ordering burritos for my two friends while they quar up in Jersey City, and want to calculate the total price of my order:

screenshot of spreadsheet; burrito price is listed as $7, burrito price w ship as burrito price plus $3, num burritos is 2, and total is num burritos times burrito price w ship, for a total of $20

It’s a little confusing to follow the flow of data in a spreadsheet when it’s written like that, so I hope you don’t mind this equivalent diagram that represents it as a graph:

the previous spreadsheet represented as a graph, with arrows from one cell to another replacing the spreadsheet cell references

We’re rounding the cost of an El Farolito super vegi burrito to $8, so assuming the per-burrito delivery toll remains at just $2 per burrito, it looks like the total for our two burritos will be $20.

Oh no, I completely forgot! One of my friends loves to wolf down multiple burritos at a time, so I actually want to place an order for three burritos. If I update Num Burritos, a naïve spreadsheet engine might recompute the entire document, recalculating first the cells with no inputs, and then recalculating any cell whose inputs are ready until we’ve finished every cell. In this case, we’d first calculate Burrito Price and Num Burritos, then Burrito Price w Ship, and then a new final Total of $30.

same as previous graph, but num burritos is updated to 3, and every cell is tagged as "recalc"

This simple strategy of recalculating the whole document may sound wasteful, but it’s actually already better than VisiCalc, the first spreadsheet software ever made, and the first so-called “killer app”, responsible for popularizing the Apple II. VisiCalc would repeatedly recalculate cells from left-to-right and top-to-bottom, sweeping over them again and again until none of them changed. Despite this “interesting” algorithm, VisiCalc remained the dominant spreadsheet software for four years. Its reign ended in 1983, when Lotus 1-2-3 swept the market with “natural-order recalculation”, as described by Tracy Robnett Licklider in Byte Magazine:

Lotus 1-2-3 exploited natural-order recalculation, although it also supported VisiCalc’s row- and column-order modes. Natural-order recalculation maintained a cell dependency list and recalculated a cell before recalculating cells that depended on it.

Lotus 1-2-3 implemented the “recalculate everything” strategy we’ve shown above, and for the first decade of spreadsheets, that was as good as it got. Yes, we recalculate every cell in the document, but at least we only recalculate every cell once.

but what about “burrito price w ship”

Great point, header 2. In my three burrito example there’s no reason to recompute Burrito Price w Ship, because changing the number of burritos we order can’t possibly influence the per-burrito price. In 1989, one of Lotus’ competitors realized this, and created SuperCalc5, presumably naming it after the theory of super burritos at the core of this algorithm. SuperCalc5 recalculated “only cells dependent on changed cells”, which would make updating the burrito count look more like this:

same as prior graph, with burrito count updated from 2 to 3, but now only the two affected cells "num burritos" and "total" are tagged as recalc

By only updating a cell when one of its inputs changes, we can avoid recalculating Burrito Price w Ship. In this case, it saves just a single addition, but on larger spreadsheets it can save quite a bit of time! Unfortunately, we now have another problem. Let’s say my friends now want meat burritos, which cost a dollar more, and simultaneously El Farolito adds a $2 fee paid per-order, regardless of how many burritos you order. Before any formula outputs are recalculated, our graph might look like this:

same as prior graph (after burrito count update finished calculation), but now burrito price is being updated from $8 to $9, and simultaneously total is updated from "burrito price w ship * num burritos" to "burrito price w ship * num burritos + $2 fee"

Since there are two updated cells here, we have a problem. Should we recalculate Burrito Price first, or Total? Ideally, we first calculate Burrito Price, notice that its output has changed, then recalculate Burrito Price w Ship, and finally recalculate Total. However, if we instead recalculate Total first, we’ll have to recalculate it a second time once the new $9 burrito price propagates down. If we don’t calculate cells in the right order, this algorithm isn’t better than recalculating the whole document. In some cases, it’s as slow as VisiCalc!

Clearly, it’s important for us to figure out the right order to update our cells. Broadly, there are two solutions to this problem: dirty marking and topological sorting.

This first solution involves marking all cells downstream from an edit as dirty. For instance, when we update Burrito Price, we would mark the downstream cells Burrito Price w Ship and Total as dirty, even before doing any recalculations:

same as prior graph with the two updates, but now three nodes are tagged as dirty: "burrito price", "burrito price w ship", and "total". would also like to apologize for the rather confusing image alt text so far; it's really hard to write these for graph diagrams!! if you are a screen reader user and have advice on better ways to do this, would love to hear from you.

Then, in a loop, we find a dirty cell that has no dirty inputs, and recalculate it. When there are no dirty cells left, we’re done! This solves our ordering problem. There’s one downside though — if a cell is recalculated and we find its new output to be the same as its previous output, we’ll still recalculate downstream cells! A little bit of extra logic can avoid actually running the formula trouble in this case, but we unfortunately still waste time marking and unmarking a lot of cells as dirty.

The second solution is topological sorting. If a cell has no inputs, we mark its height as 0. If a cell has inputs, we mark its height as the maximum of the heights of its inputs, plus one. This guarantees all cells have a greater height than any of their inputs, so we just keep track of all cells with a changed input, always choosing the cell with the lowest height to recalculate first:

same as prior graph with the two updates, but instead of dirty tags, now every node has a height tag. "burrito price" and "num burritos", the two cells with no in-nodes, have height 0. "burrito price w ship" has height 1. "total" has height 2.

In our double-update example, Burrito Price and Total would be initially added to the recalculation heap. Burrito Price has lesser height, and would be recalculated first. Since its output changes, we then would add Burrito Price w Ship to the recalculation heap, and since it too has less height than Total, it would be recalculated before we finally recalculate Total.

This has a big advantage over the first solution: no cell is ever marked dirty unless one of its inputs actually change. However, it requires we keep all cells pending recalculation in sorted order. If we use a heap, this results in an O(n log n) slowdown, so in the worst case, asymptotically slower than Lotus 1-2-3’s strategy of recalculating everything.

Modern-day Excel uses a combination of dirty marking and topological sorting, which you can read more about in their docs.

demand-driven complications

We’ve now more or less reached the algorithms used in modern-day spreadsheet recalculation. Unfortunately, I suspect there is basically no business case to be made for ever improving it further. The few people with the problem “my Excel spreadsheet is too slow” have already written enough Excel formulas that migration to any other platform is impossible. Fortunately, I have no understanding of business, and so we’re going to look at further improvements anyway.

Beyond caching, one of the cool aspects of a spreadsheet-style computation graph is we can only calculate the cells that we’re interested in. This is sometimes called lazy computation, or demand-driven computation. As a more concrete example, here’s a slightly expanded burrito spreadsheet graph. This example is the same as before, but we’ve added what is best described as “salsa calculations”. Each burrito contains 40 grams of salsa, and we perform a quick multiplication to know how much salsa is in our entire order. In this case, since our order has three burritos, there’s a total of 120 grams of salsa in our entire order.

a new graph. similar structure to the old graph, but there are two new nodes: "salsa per burrito", which is set to the constant "40 grams", and "salsa in order", which is "salsa per burrito" times "num burritos"

Of course, astute readers will have spotted the problem here already: knowing the total weight of salsa in an order is a pretty useless measurement. Who cares that it’s 120 grams? What am I supposed to do with this information?? Unfortunately, a regular spreadsheet would waste cycles calculating Salsa In Order, even if we don’t want it recalculated most of the time.

This is where demand-driven recalculation can help. If we could somehow specify that we’re only interested in the output of Total, we could only recompute that cell and its dependencies, and skip touching Salsa In Order and Salsa Per Burrito. Let’s call Total an observed cell, since we’re trying to look at its output. We can also call both Total and its three dependencies necessary cells, since they’re necessary to compute some observed cell. Salsa In Order and Salsa Per Burrito would be aptly described as unnecessary.

Some folks on the Rust team created the Salsa framework to solve this problem, clearly naming it after the unnecessary salsa calculations their computers were wasting cycles on. Salsa is really cool, and I’m sure they can explain how it works better than I can. Very roughly, they use revision numbers to track whether a cell needs recalculation. Any mutation to a formula or input increments the global revision number, and every cell tracks two revisions: verified_at to track the revision its output was last brought up-to-date, and changed_at to track the revision its output last actually changed.

our new graph, but now there's a title of "current revision: R6". each cell is tagged with a change revision and verified at revision. all change revisions are R1, except "salsa per burrito", which is R6. all verified at revisions are R6, except "salsa in order", which is R1.

When the user indicates they’d like a fresh value for Total, we’d first recursively recalculate any cell necessary to Total, skipping cells if their last_updated revision is equal to the global revision. Once the dependencies of Total are up-to-date, we only rerun the actual formula in Total if either Burrito Price w Ship or Num Burrito have a changed_at revision greater than the verified_at revision of Total. This is great for Salsa’s purposes in the rust-analyzer, where simplicity is important and each cell takes a significant amount of time to compute. However, we can see the disadvantages in our burrito graph above — if Salsa Per Burrito constantly changes, our global revision number will frequently tick up. This will make each observation of Total walk the three cells necessary to it, even though none of those cells have actually changed. No formulas will be recalculated, but if the graph is large, repeatedly walking all of a cell’s dependencies could get expensive.

faster demand-driven solutions

Instead of inventing new algorithms for demand-driven spreadsheets, what if we instead draw from the two classical spreadsheet algorithms mentioned earlier: dirty marking and topological sorting? As you might imagine, a demand-driven model complicates both of these, but both are still viable.

Let’s first look at dirty marking. As before, when we change a cell’s formula, we mark all downstream cells as dirty. So if we update Salsa Per Burrito, it would look something like this:

same salsa graph, but no revision tags. instead, "salsa per burrito" is updating from 40 grams to 45 grams, and "salsa per burrito" and "salsa in order" are both tagged as "dirty"

However, instead of immediately recomputing all cells that are dirty, we wait for the user to observe a cell. Then we run Salsa’s algorithm on the observed cell, but instead of reverifying dependencies with outdated verified_at revision numbers, we instead only reverify cells marked as dirty. This is the technique used by Adapton. You’ll note that when we observe Total, we’ll find it isn’t dirty, and so we can skip the graph walk that Salsa would have performed!

If we instead decide to observe Salsa In Order, we’ll find it is marked as dirty, and so we’ll reverify and recompute both Salsa Per Burrito and Salsa In Order. Even here, there are benefits over using just revision numbers, since we’ll be able to skip a recursive walk on the still-clean Num Burritos cell.

Demand-driven dirty marking performs fantastically when the set of cells we’re trying to observe changes frequently. Unfortunately, this has the same downsides as the dirty marking algorithm from before. If a cell with many downstream cells changes, we may waste a lot of time marking and unmarking cells as dirty, even if their inputs won’t actually have changed when we go to recalculate them. In the worst case, each changes causes us to mark the entire graph as dirty, which would give us the same ~order of magnitude performance as Salsa’s algorithm.

Moving on from dirty marking, we can also adapt topological sorting for demand-driven computations. This is the technique used by Jane Street’s Incremental library, and requires major tricks to get right. Before we added demand-driven computations, our topological sorting algorithm used a heap to determine which cell would be recomputed next. But now, we only want to recompute cells that are necessary. How? We don’t want to walk the entire tree from our observed cells like Adapton, since a complete tree walk defeats the entire purpose of topological sorting, and would give us performance characteristics similar to Adapton.

Instead, Incremental maintains a set of cells that the user has marked observed, as well as the set of cells that are necessary to any observed cell. Whenever a cell is marked observed or unobserved, Incremental walks that cell’s dependencies to make sure the necessary marks are applied correctly. Then, we only add cells to the recalculation heap if they’re marked necessary. In our burrito graph, if only Total is part of the observed set, changing Salsa in Order would not result in any walking of the graph, since only necessary cells are recomputed:

same graph as before; "salsa per burrito" is being updated from 40 to 45 grams, and is tagged as "dirty". "total" and its dependencies "burrito price", "num burritos", and "burrito price w ship" are tagged as "necessary". "total" is also tagged as "observed".

This solves our problem without eagerly walking the graph to mark cells as dirty! We still have to remember that Salsa per Burrito is dirty, since if it later becomes necessary, we will need to recompute it. But unlike Adapton’s algorithm, we don’t need to push this single dirty mark down the entire graph.

anchors, a hybrid solution

Both Adapton and Incremental walk the graph, even when not recomputing cells. Incremental walks the graph upstream when the set of observed cells changes, and Adapton walks the graph downstream when a formula changes. With these small graphs, it may not be immediately apparent that this graph walking is expensive. However, if your graph is large and your cells are relatively cheap to compute — often the case with spreadsheets — you’ll find that most of your costs come from wasted graph walking! When cells are cheap, marking a bit on a cell can cost roughly the same as just recomputing the cell from scratch. So ideally, if we want our algorithm to be substantially faster than computing from scratch, we’ve got to avoid unnecessarily walking the graph as best we can.

The more I thought about the problem, the more I realized that they waste time walking the graph in roughly opposite situations. In our burrito graph, let’s imagine cell formulas rarely change, but we rapidly switch between first observing Total, and then observing Salsa in Order.

same graph as before, but arrows pointing to "total" and "salsa in order" indicate we're rapidly switching which of the two we are observing

In this case, Adapton will never walk the tree. No inputs are changing, and so we never need to mark anything as dirty. Since nothing is dirty, each observation is cheap as well, since we can simply return the cached value from a clean cell immediately. However, Incremental performs poorly in this example. Even though no values are ever recomputed, Incremental will repeatedly mark and unmark many cells as necessary and unnecessary.

In the opposite case, let’s imagine a graph where our formulas are rapidly changing, but we don’t change which cells we’re observing. For instance, we could imagine we’re observing Total while rapidly changing Burrito Price from 4+4 to 2*4.

same graph as before, arrow pointing to "burrito price" indicates its value is rapidly switching.

Just like the previous example, we aren’t really recomputing many cells. 4+4 and 2*4 both equal 8, so ideally we only recompute that one bit of arithmetic when the user makes this change. However, unlike the previous example, Incremental is now the library that avoids tree walking. With Incremental, we’ve cached the fact that Burrito Price is a necessary cell, and so when it changes, we can recalculate it without walking the graph. With Adapton, we waste time marking Burrito Price w Ship and Total as dirty, even though the output of Burrito Price won’t have changed.

Given each algorithm performs well in the other’s degenerate cases, wouldn’t it be ideal if we could just detect those degenerate cases, and switch to the faster algorithm? This is what I have attempted to do with my own library, Anchors. Anchors runs both algorithms simultaneously on the same graph! If this sounds wild and unnecessary and overly complicated, that’s probably because it is.

In many cases, Anchors follows Incremental’s algorithm exactly, which avoids Adapton’s degenerate case above. But when cells are marked as unobserved, its behavior diverges slightly. Let’s look at what happens. We start with marking Total as observed:

same as "rapidly switching which is observed" graph but now total and its dependencies are marked as "necessary". total is marked as "observed".

If we then mark Total as unobserved and Salsa in Order as observed, the traditional Incremental algorithm would alter the graph to look like this, walking through every cell in the process:

same graph as above, but now "salsa in order" and its dependencies are marked as "necessary". "salsa in order" is also marked as observed.

Anchors also walks every cell for this change, but instead produces a graph that looks like this:

same graph as above with "salsa in order" observed, but now additionally "total", "burrito price w ship", and "burrito price" are all marked as "clean".

Note the “clean” flags! When a cell is no longer necessary, we mark it as “clean”. Let’s look at what happens when we switch from observing Salsa in Order to Total:

All nodes in the graph are now tagged as "clean". also, "total" is marked as "observed"

That’s right — our graph now has no necessary cells. If a cell has a “clean” flag, we never mark it as observed. At this point, no matter what cell we mark as observed or unobserved, Anchors will never waste time walking the graph — it knows that none of the inputs have changed.

Looks like there’s a discount at El Farolito! Let’s drop Burrito Price by a dollar. How does Anchors know Total needs to be recomputed? Before we recompute any formulas, let’s look at how Anchors will see the graph right after we change just Burrito Price, before recomputing anything:

same graph as above, but now "total", "burrito price w ship", and "burrito price" are tagged as "dirty" instead of "clean".

If a cell has a clean flag, we run the traditional Adapton algorithm on it, eagerly marking downstream cells as dirty. When we later run the Incremental algorithm, we can quickly tell that there is an observed cell marked as dirty, and know we need to recompute its dependencies. After that’s finished, the final graph will look like this:

same graph as above, but now "total", "burrito price w ship", and "burrito price" are tagged as "necessary" instead of "dirty".

We only recompute cells if they’re necessary, so whenever we recompute a dirty cell, we also mark it as necessary. At a high level, you can imagine these three cell states form a looping state machine:

three states "necessary", "clean" and "dirty", with an arrow from "necessary" to "clean", "clean" to "dirty", and "dirty" to "necessary".

On necessary cells, we run the Incremental’s topological sorting algorithm. On unnecessary cells, we run the Adapton algorithm.

burrito syntax

I’d like to finish up by answering a final question: so far, we’re been discussing the many problems the demand-driven model causes for our recomputation strategy. But the problems aren’t just for the algorithm: there are syntactic problems to solve too. For instance, let’s make a spreadsheet to select a burrito emoji for our customer. We’d write an IF statement in the spreadsheet like this:

a new spreadsheet. vegetarian customer is set to YES, vegi burrito is an emoji of a broccoli, meat burrito is an emoji of a meat, and burrito for customer is "=IF(B1=YES, B2, B3)"

Because traditional spreadsheets aren’t demand-driven, we can compute B1, B2, and B3 in any order, so long as we compute our IF cell after all three inputs are ready. In a demand-driven world however, if we can calculate the value of B1 first, we can peek at the value to know which of B2 or B3 we need to recalculate. Unfortunately, a traditional spreadsheet’s IF has no way to express this!

The problem: B2 simultaneously references cell B2 and retrieves its value, 🥦. Most demand-driven libraries mentioned in this post instead explicitly distinguish between a reference to a cell and the act of retrieving its actual value. In Anchors, we call this cell reference an Anchor. Much like a burrito in real life wraps a bunch of ingredients together, the Anchor<T> type wraps T — which I suppose makes our vegi burrito cell Anchor<Burrito>, a sort of ridiculous burrito of burritos. Functions can pass around our Anchor<Burrito> as much as they’d like — it’s only when they actually unwrap the burrito to access the Burrito inside that we create a dependency edge in our graph, indicating to the recomputation algorithm that the cell may be necessary and need recomputing.

The approach taken by Salsa and Adapton is to use function calls and normal control flow as a way to unwrap these values. For instance, in Adapton, we might write our “Burrito for Customer” cell something like this:

let burrito_for_customer = cell!({
    if get!(is_vegetarian) {
        get!(vegi_burrito)
    } else {
        get!(meat_burrito)
    }
});

By distinguishing between a cell reference (vegi_burrito here) and the act of unwrapping its value (get!), Adapton can piggy-back on top of Rust’s control flow operators like if. This is a great solution! However, a little bit of magical global state is needed to correctly connect the get! calls to the cell! that needs recomputing when is_vegetarian changes. The approach I’ve taken with Anchors, inspired by Incremental, is a slightly less magical system. Similar to a pre-async/await Future, Anchors allows you to use operations like map and then on an Anchor<T>. For instance, the example above would look something like this:

let burrito_for_customer: Anchor<Burrito> = is_vegetarian.then(|is_vegetarian: bool| -> Anchor<Burrito> {
  if is_vegetarian {
      vegi_burrito
  } else {
      meat_burrito
  }
});

further reading

In this already long and winding blog post, there is so much I didn’t have space to talk about. Hopefully these resources can shed a little bit more light on this very interesting problem.

And of course, you can always check out the framework I built, Anchors.


Thanks to Syd Botz, Adam Perry, and Peter Stefek for feedback and discussion!


Lord i/o is a blog about tools, text, and text editing. Subscribe to new posts via Twitter, Atom feed, or email: