slog alloca

deception

In the previous episode of alloca shenanigans (which is not required reading for this post), I said this.

The stack is like a stack of plates. You can't change things in the middle or the whole thing falls down

I lied! Well... sort of. You see... the stack is just a bunch of bits, which means we don't have to treat the stack like a stack at all! In this second episode of alloca shenanigans we'll be implementing a custom data-structure in stack memory, which will allow us to resize things in the middle of the stack. I'll end it in a grand crescendo where we implement a resizing hashmap in stack memory with a custom programming language!

You can see the project on the Github.

the structure

We will be using a list-like data-structure for our custom stack, each element will contain data followed by the size of the data. For example, a list with two elements will look like this:

    <DATA> <DATA> <DATA> < 3  > <DATA> < 1  >
                                            |
                                           rsp

The stack pointer will point to the last element in the list. The size section of each data element is named "the size tag" and it is stored as an 8-byte u64.

implementation

We'll start off implementing our stack in assembly. We need to write some primitives for stack allocation, alloc, pop and resize.

Let's start off implementing alloc because it's the simplest.

Because of our unique constraints, we aren't allowed to use call, push, pop or any assembly instruction that modifies the stack. Instead we can use jmp instructions and store the return address in a register.

Here is the assembly code:

    alloc:
        ; allocate size bytes
        ; set the last 8 bytes to size
        sub rsp, rbx
        mov qword [rsp], rbx

        ; jump back to the caller
        jmp rax

The return address is given in rax and the amount (in bytes) to allocate is given in rbx. Note the mov qword and how the last 8-bytes of the allocation are used for the size. The amount to allocate in rbx has to include an extra 8-bytes for the size tag.

And here is the implementation for pop.

    fn pop() {
        rsp -= *rsp;
    }

    pop:
        add rsp, [rsp]
        jmp rax

Now, implementing resize is going to be a little more complicated. For the sake of brevity I'll just show the pseudo-code, but you can imagine how this would translate to assembly.

    fn resize(ptr: *const void, amount: i64) {
        // the stack moves downwards
        // we start at i=rsp and move up to ptr
        for (let i = rsp; i < ptr; i += 8) {
            *(i - amount) = *i;
        }

        // update the size and rsp
        *(ptr - amount) = *ptr + amount;
        rsp -= amount;
    }

We'll start at rsp and move backwards until we reach ptr. We'll move all the data we meet along the way down by the given amount.

    <DATA> <DATA> <SIZE=2> <DATA> <SIZE=1>
                         |               |
                        ptr             rsp

    <DATA> <DATA> <SIZE=2>          <DATA> <SIZE=1>
                         |               |
                        ptr             rsp

Then we set the new size and update rsp.

    <DATA> <DATA> <DATA> <SIZE=3> <DATA> <SIZE=1>
                                |               |
                           (ptr - amount)      rsp

Nice! Dynamic stack memory! But, there is still one more issue, and that is pointers. If we resize memory before a pointer, then everything get's pushed around and our pointer no longer points to the right object. Oh No!

    <DATA> <SIZE=1> <DATA> <SIZE=1>
                                  |
                            ptr (2nd item)

    <DATA> <DATA> <DATA> <SIZE=3> <DATA> <SIZE=1>
                                  |
                            ptr (invalidated)

The solution? Instead of storing the address of our data, we can store the index. Then we can convert that index into an address when we need to use it. For example, instead of 0x123456789abcdef we store 0 for the first item in the stack.

We can convert an index into an address by calculating how many steps backwards our data is from the end of the stack with length - index - 1. For example, if our stack has 5 elements and we want to find the data at index 2 we would need to jump 5-2-1 = 2 step backwards from the end of the stack. Here is the pseudo-code for converting an index into an address.

    fn idx2addr(idx: u64) -> u64 {
        let ptr = rsp;

        for (let i = 0; i < length-idx-1; i++) {
            ptr -= *ptr;
        }

        return ptr;
    }

Note that this means we have to keep track of the length of our list in a register somewhere. Since we aren't using rbp for anything else, I've designated rbp as our length register. We'll update alloc and pop to increment/decrement rbp. For deeply nested data-structures like &arr[0][1][2], we'll have to store the reference as a list of indicies. This means that references to deeply nested data-structures can get quite long, but that's a sacrifice I'm willing to make.

You can download an ASM code example here, but you may need an exotic debugger to be able to visualise how stack memory changes.

a language

We have an issue. Writing assembly is painful, but no language supports our funky stack, they all expect the stack to behave in a certain way. This is where the custom programming language comes in. It doesn't need to be fancy, just a basic wrapper around assembly. Like C but for stack shenanigans.

I'll be glossing over the whole "making a language" part in this slog post. I highly suggest you try following Make a Lisp and LLVM Kaleidoscope if you're feeling up for a challenge. They'll teach you a lot about writing your own languages. You can see my own implementation of "Make a Lisp", here.

For simplicity, the language will compile to NASM assembly. First we have a tokenizer. It'll convert a stream of text like:

    u32 inc(u32 a) {
        return a+1;
    }

Into a list of tokens, like:

    [
        Symbol("u32"), Symbol("inc"),
        Operand("("), Symbol("u32"), Symbol("a"), Operand(")"),
        Operand("{"),
        Return, Symbol("a"), Operand("+"), Number(1), Operand(";"),
        Operand("}"),
    ]

I decided to give the language C-like syntax because C is familiar and easy to parse. Now we have a stream of tokens, it'll make the next step a little bit easier.

We need to convert our stream of tokens into an abstract syntax tree (AST). For example, the abstract syntax tree of 2+3*4 would look something like:

         Sym(+)
         /   \
        /     \
       /       \
    Num(2)    Sym(*)
              /   \
             /     \
            /       \
        Num(3)      Num(4)

Parsing simple expressions like set x = 3; into an AST is pretty easy, but for more complicated expressions like 2+3*4 you have to start worrying about operator precedence. There's a wonderful section in LLVM Kaleidoscope and the first part of this youtube video that both show examples of a binary expression parser.

If you want to try implementing a parser yourself and are a beginner. I highly suggest using lisp-like syntax since it is much easier to parse. There is a section in Make a Lisp that might be helpful for you.

In summary, a binary expression parser that uses operator-precedence parsing will take the following steps. First, it parses a "primary" expression, we will call it lhs for "left-hand-side". A primary expression is any expression without an operand. That means numbers, or booleans, or strings, but also any expression wrapped in parenthesis.

Then we can write the function signature for our binary operator parser. The minimum parameter will be useful later.

    fn parse_binop(minimum: usize, lhs: Expression);

And we can pass in the lhs we parsed earlier.

    parse_binop(0, lhs);

Inside the body of parse_binop, it peeks at the right-hand-side for an operand (like +, *, etc.)

     lhs   operand
      |      |
      2      +

If the operand has lower precedence than the minimum, we just return lhs and allow some other function higher in the call stack to parse the rest of it.

    fn parse_binop(minimum, lhs) {
        if peek_operand().precedence < minimum {
            return lhs;
        }

        ...

If the operand has higher precedence than the minimum, then parse the next primary and next operand.

    let prev = parse_operand(); 
    let rhs = parse_primary();
    let next = peek_operand();

Our example now looks like this:

      prev  next
       |     |
    2 [+] 3 [*]
    |     |
   lhs   rhs

If the next operand has higher precedence than the previous then recursively call the parser with the "minimum" set to the previous operand's precedence.

    if prev < next {
        rhs = parse_binop(prev.precedence, rhs);
    }

Then we join the left-hand-side and right-hand-side.

    lhs = Ast(lhs, prev, rhs)

And recursively call again with the same "minimum" value.

    return parse_binop(minimum, lhs);

Here is the full pseudo-code:

    fn parse_expression() {
        return parse_binop(0, parse_primary());
    }

    fn parse_binop(minimum, lhs) {
        if peek_operand().precedence < minimum {
            return lhs;
        }

        let prev = parse_operand(); 
        let rhs = parse_primary();
        let next = peek_operand();

        if prev.precedence < next.precedence {
            rhs = parse_binop(prev.precedence, rhs);
        }

        lhs = Ast(lhs, prev, rhs)

        return parse_binop(minimum, lhs);
    }

Now we have an AST that can represent function definitions, variable assignments, expressions, and control flow operations (if, while, for). We need to write a code-generator that generates our NASM assembly. It'll take in our AST and output NASM. Then we can call the nasm command line tool directly to compile it into a binary.

The code-generator was the hardest part to do. It is so hard to debug thousand-line assembly files where no debugger works because your stack is too screwed, which is why I wrote my own (wink wink nudge nudge).

Let's rewrite some of our assembly functions in our code-generator. For our alloc function, we'll make sure to keep track of the stack length and return the index of the allocated data in the stack.

    struct Codegen {
        stack_len: usize,
    }
    
    fn alloc(codegen: &Codegen, size: u64) -> usize {
        let idx = codegen.stack_len;
        
        // allocate size bytes
        // set the last 8 bytes to size
        write_asm!("sub rsp, {size}");
        write_asm!("mov qword [rsp], {size}");
        codegen.stack_len += 1;

        return idx;
    }

And the pop function follows in the same vein.

    fn pop(codegen: &Codegen) {
        write_asm!("add rsp, [rsp]");
        codegen.stack_len -= 1;
    }

Then we can implement our idx2addr function. It will take in an index and return the address in a register.

    fn idx2addr(codegen: &Codegen, idx: usize, reg: &str) {
        write_asm!("mov {reg}, rsp");

        for (let i = 0; i < codegen.stack_len-idx-1; i++) {
            write_asm!("add {reg}, [{reg}]");
        }
    }

Now we can start with code-generating a simple number. We'll need to allocate some space on the stack and then put that number in it.

    fn codegen_number(codegen: &Codegen, number: Number) -> usize {
        let idx = codegen.alloc(16);
        codegen.idx2addr(idx, "rbx");
        write_asm!("mov qword [rbx], {number}");
        return idx;
    }

Now let's codegen an add function.

    fn codegen_add(codegen: &Codegen, a: Expression, b: Expression) -> usize {
        // codegen the expressions
        let a = codegen.codegen_expression(a);
        let b = codegen.codegen_expression(b);

        // dereference the indicies
        codegen.idx2addr(a, "rbx");
        codegen.idx2addr(b, "rcx");
        write_asm!("mov rbx, [rbx+8]");
        write_asm!("mov rcx, [rcx+8]");

        // add the numbers
        write_asm!("add rbx, rcx");

        // allocate and return the result
        return codegen.codegen_number("rbx");
    }

Now we'll write the expression generator.

    fn codegen_expression(codegen: &Codegen, expression: Expression) -> usize {
        match expression {
            Number(num) => codegen.codegen_number(num),
            Add(a, b) => codegen.codegen_add(a, b),
        }
    }

And from here it's easy to imagine how the rest of the language was implemented.

Of course, our code-generator will be generating assembly that utilises our funky stack.

    void main() {
        1 + 2;
    }

Gets converted roughly into:

    _start:
        ; allocate 'a' on stack
        sub rsp, 16
        mov qword [rsp], 16
        mov qword [rsp+8], 1

        ; allocate 'b' on stack
        sub rsp, 16
        mov qword [rsp], 16
        mov qword [rsp+8], 2

        ; dereference indicies
        mov rbx, rsp
        mov rcx, rsp
        add rcx, [rcx]

        mov rbx, [rbx+8]
        mov rcx, [rcx+8]

        ; add numbers
        add rbx, rcx

        ; allocate result on the stack
        sub rsp, 16
        mov qword [rsp], 16
        mov qword [rsp+8], rbx

        ; exit(1)
        mov rax, 0
        mov ebx, 1
        int 0x80

This part (making the language) was the most hardest part by far. Writing the hash-map is a cake walk in comparison.

a cake walk (in comparison)

hash-maps are simple actually, they're just arrays with clever indexing. We take a key, hash it, modulo it over the length of the array, and that's our index. If there was a hash collision (two keys share the same hash) then resize the array and try again.

        struct Bucket {
            i32 key,
            i32 value,
        }

        struct HashMap = []Bucket;

        void hashmap_insert(&HashMap map, i32 key, i32 value) {
            // get a random index
            u64 i = hashfunction(key) % map.size;

            if (map[i].key == key or map[i].is_empty()) {
                // if the key matches or the entry is empty.
                // then insert!
                map[i].key = key;
                map[i].value = value;
            } else {
                // if the key doesn't match and the entry isn't empty.
                // then we have a hash collision! let's resize the hashmap and try again.
                hashmap_resize(map, map.size * 2);
                hashmap_insert(map, key, value);
            }
        }

The hashmap_resize function is just going to create a new hashmap and reinsert all the items:

        void hashmap_resize(&HashMap map) {
            HashMap new_map = new HashMap();

            for (i32 i = 0; i < map.size; i++) {
                if (!map[i].is_empty()) {
                    hashmap_insert(&new_map, map[i].key, map[i].value);
                }
            }

            *map = new_map
        }

There are many different techniques that can avoid hash collisions and excessive reallocation by using adjacent slots or more "spread-out" hash functions. Honorable mention: Fibonacci Hashing

Conclusion

That's it! This is the end of the post. Obviously, this is completely useless and no one would want to use a stack like this because of the *awful* performance it gives. I mean look how many instructions a simple stack allocation takes... and oh god, the time complexity! It was fun though!