Over-the-top optimisations with Nim

Published on Dec 9, 2018

For the past few years I’ve been taking part in Eric Wastl’s Advent of Code, a coding challenge that provides a 2-part problem each day from the 1st of December through to Christmas Day. The puzzles are always interesting — especially as they get progressively harder — and there’s an awesome community of folks that share their solutions in a huge variety of languages.

To up the ante somewhat, Shane and I usually have a little informal competition to see who can write the most performant code. This year, though, Shane went massively overboard and wrote an entire benchmarking suite and webapp to measure our performance, which I took as an invitation and personal challenge to try to beat him every single day.

For the past three years I’d used Python exclusively, as its vast standard library and awesome syntax lead to quick and elegant solutions. Unfortunately it stands no chance, at least on the earlier puzzles, of beating the speed of Shane’s preferred language of PHP. For a while I consoled myself with the notion that once the challenges get more complicated I’d be in with a shot, but after the third or fourth time that Shane’s solution finished before the Python interpreter even started^[1] I decided I’d have to jump ship. I started using Nim.

Introducing Nim

Nim, formerly Nimrod, is a compiled language that takes a lot of cues from Python. It has a very nice and familiar syntax, a reasonable standard library, and it’s fast. I’d thought about learning it before but didn’t really have anything suitable to use it on, until now. The code I used for my day one part one answer looks like this in Nim:

import math, sequtils, strutils

echo readFile("data/01.txt").strip.splitLines.map(parseInt).sum

It’s a one liner that Python would be proud of. The difference with Nim, though, is that this compiles down to C, and from there you get all the benefits of an optimising C compiler and linker. You end up with a blazingly fast stand-alone binary.

Losing my marbles

Day 9 of this year’s Advent of Code proved interesting to optimise, and I’m going to walk through some of the steps I took and their impact. I’m in no way a Nim expert and this is for a program that will be ran once and then thrown away, so please don’t take this too much to heart.

Day 9 presents a marble game played by Santa’s elves, whereby marbles with increasing values are added to a circle according to certain rules; every 23rd marble is special and the elf playing it gets to keep that one and also pick up a marble a certain number of places away. The winner is the one with the highest marble value at the end. It doesn’t sound like a particularly thrilling game, but as far as I can tell there’s no way to easily predict the winner without simulating it step-by-step so it makes for an interesting problem.

Naive solution: over 10 minutes

My puzzle input called for a game with 72,104 marbles. My initial approach was to use a sequence (similar to a list) to store the values of the marbles as they’re added to the circle. This got an answer for part 1 in a about 10 seconds and put me at number 124 on the global leaderboard for fastest completion. Unfortunately, when part 2 was revealed it asked me to calculate the result if there were 7,210,400 marbles in play.

Obviously a puzzle 100x larger would take at least 100x longer to run, and almost certainly a lot more than that. There isn’t a way to calculate the advance stages more quickly, so the only thing to be done is to make it run a lot faster. Seven million iterations isn’t really that much of a burden for a modern CPU: for the code to be running this slowly the execution time of some of the operations must be scaling with the number of marbles. A quick look through the documentation reveals:

proc del[T](x: var seq[T]; i: Natural) {...}
deletes the item at index i by putting x[high(x)] into position i.
This is an O(1) operation.

proc delete[T](x: var seq[T]; i: Natural) {...}
deletes the item at index i by moving x[i+1..] by one position.
This is an O(n) operation.

Because we have to delete a marble at an arbitrary point and maintain the ordering of the others, I was using the delete() proc which has an O(n) runtime. The other potentially costly operation is inserting a new marble; the documentation doesn’t mention the runtime but all of the nim docs have a direct link to the source code, and we can see that inserting an element requires iterating over all the elements after it, so it’s also O(n) in the worse case.

DoublyLinkedLists: ~500ms

When you need performant inserts and deletes in a list, the go-to solution is a linked list. Because nodes store references to their neighbours (instead of being stored consecutively in an array or list), delete and insert operations are O(1): you simply need to change a few pointers. Nim’s lists package provides a convenient DoublyLinkedList that I went ahead and used.

Instead of using the old insert and delete methods I now had my own which simply manipulate the nodes’ previous and next pointers:

func insertAfter(node: DoublyLinkedNode[int], value: int) =
    var newNode = newDoublyLinkedNode(value)
    newNode.next = node.next
    newNode.prev = node
    newNode.next.prev = newNode
    newNode.prev.next = newNode

 func remove(node: DoublyLinkedNode[int]) =
    node.prev.next = node.next
    node.next.prev = node.prev

This implementation brought the runtime down to a respectable 500ms, which handily beat Shane’s PHP implementation. It was still an order of magnitude longer than any of my other solutions, though, so I wasn’t happy yet.

Reduced imports: ~470ms

One thing I was conscious of from trying to make Python performant was how the number of imports can pile on to startup time. I had a couple of unused imports that were easy to shed, and I also decided to implement my own linked list in favour of nim’s lists module. All this involved was defining a type and then replacing my usages of DoublyLinkedNode[int] with my new Marble.

type
    Marble = ref object
        next, prev: Marble
        value: int

These few changes didn’t have a huge impact, but I was clutching at straws and every 30ms was a small victory.

Inlining methods and small optimisations: ~420ms

Thinking the code was about as fast as I was going to get it, I made a final pass to see if there were any little tweaks I could make. First off, I added the inline pragma to my insert and remove methods, to hint to the C compiler that they should be inlined. I was concerned that the overhead of calling a function seven million times would add up, and inlining the fairly simple operation seems reasonable. It’s entirely possible the C compiler was already doing this (they’re pretty clever), but making the hint explicit in Nim is really easy so there’s nothing to lose:

func insertAfter(node: Marble, value: int) {.inline.} =
    var newNode = new(Marble)
    newNode.value = value
    newNode.next = node.next
    newNode.prev = node
    newNode.next.prev = newNode
    newNode.prev.next = newNode

func remove(node: Marble) {.inline.} =
    node.prev.next = node.next
    node.next.prev = node.prev

I also made some small algorithmic tweaks. These are usually the bread and butter of optimisations but for this problem there were only a couple I could see:

We only care about the current player every 23rd marble, so instead of tracking the player each turn we can just calculate a 23 player jump when needed
Instead of testing whether the current marble is divisible by 23, which is potentially non-trivial for large numbers, we can use a separate variable that just counts down from 23 and gets reset
Instead of calculating the boundary condition (100 * marbles) whenever it’s used, we can put this in a variable and calculate it once up-front. (The C compiler probably handled this for us anyway)

These combination of tweaks saved another 50ms, and it seemed like there wasn’t a whole lot left that could possibly change.

Non-reference counted objects: ~180ms

While I was pondering further improvements, Shane mentioned that he managed to make PHP’s garbage collector segfault with his solution. That got me thinking: what would happen if Nim didn’t have to worry about garbage collecting our marbles? We have a fixed amount of them and don’t need to worry about memory leaks as the program runs for half a second and then quits. Changing the Marble type and manually allocating memory for it — something that is virtually impossible in languages like PHP or Python — was trivial in Nim:

type
    Marble = object
        next, prev: ptr Marble
        value: int32

proc insertAfter(node: ptr Marble, value: int) {.inline.} =
    var newNode = cast[ptr Marble](alloc0(sizeof(Marble)))

Taking the garbage collector out of the equation over doubled the performance! Still, it was my only solution that took more than 100ms and that bothered me…

No looking back: ~120ms

Thinking about memory allocations made me take a hard look at the structure of the Marble type. Each of the seven million marbles has a previous pointer that we only use to backtrack by a fixed amount every 23rd play, which seems wasteful. If we reduce the amount of memory we have to allocate, we’ll logically reduce the time taken allocating it.

As the game is simulated we keep track of the “current” marble, so why not keep track of the marble eight behind that? That would allow us to turn the doubly-linked list into a singly-linked list and save a whole bunch of memory. This ends up being slightly complicated as initially there aren’t eight marbles, and every 23rd play we jump the current position backwards (and without previous pointers, we can’t jump the “current minus eight” pointer backwards).

To work around these issues, I added a “trailing” pointer that gradually drifts backwards to eight behind the current pointer as moves are played. There are 22 normal moves that each advance the current pointer by two, so there’s plenty of time for this to happen.

var
    currentTrail = current
    currentTrailDrift = 0

# When a standard move occurs:
current.next.insertAfter(i)
current = current.next.next
if currentTrailDrift == 8:
    # Keep the trail eight marbles behind the current one
    currentTrail = currentTrail.next.next
else:
    # Don't move the trail so it drifts away by two marbles
    currentTrailDrift += 2

This is one of those optimisations that makes the code a bit harder to follow, but it sliced a third of the runtime off and takes us tantalisingly close to that 100ms threshold.

One bulk order of memory, please: ~50ms

Thinking about memory allocations, I realised we were doing seven million small allocations over the lifetime of the program. We know upfront how many marbles there are going to be and will need to allocate memory for them all at some point, so why not just do it in one big bang?

Fortunately, again, Nim lets you dive from the high-level Python-like world down to the nitty-gritty of memory management and pointers without blinking. Now after reading the puzzle input, I allocate a big chunk of memory (for my input with seven million marbles this equates to around 86MB of RAM) and keep a pointer to it:

let
    hundredMarbles = marbles * 100
    memory = alloc(MarbleSize * hundredMarbles)

Then when it comes to creating a “new” Marble, we simply calculate the position in our memory block and use it as a pointer:

proc addressOf(memory: pointer, marbleNumber: int): Marble {.inline.} =
    cast[Marble](cast[uint](memory) + cast[uint](marbleNumber * MarbleSize))

proc insertAfter(node: Marble, memory: pointer, value: int): Marble {.inline.} =
    var newNode = memory.addressOf(value)

Changing to this one-time allocation more than halved the runtime of the program, placing it firmly under the 100ms target I was aiming at. It’s particularly pleasing how little effort was required for optimisations like this, and how you can switch from high-level Python-style code to low-level C-style pointer manipulation.

You can find the full code to my solution in my aoc-2018 repository. If you’re not taking part in Advent of Code I highly recommend it, and if you’ve not used Nim it’s definitely worth a look.

PHP has always been fast to start, due to its primary use in a CGI environment, and the last few major versions of PHP have made its unbelievably blazingly fast as well, while Python unfortunately has issues with startup time ↩︎