deception
In the previous episode of alloca shenanigans, I said this.
The stack is like a stack of plates. You can't change things in the middle or the whole thing falls down
I lied! Well... sort of. You see... the stack is just a bunch of bits, which means we don't have to treat the stack like a stack at all! In this second episode of alloca shenanigans we'll be implementing a custom data-structure in stack memory, which will allow us to resize things in the middle of the stack. I'll end it in a grand crescendo where we implement a resizing hashmap in stack memory with a custom programming language!
You can see the project on the Github.
the structure
We will be using a list-like data-structure for our custom stack, each element will contain data followed by the size of the data. For example, a list with two elements will look like this:
<DATA> <DATA> <DATA> < 3 > <DATA> < 1 >
↑
rsp
The stack pointer will point to the last element in the list.
implementation
We'll start off implementing our stack in assembly.
We need to write some primitives for stack allocation, alloc, pop and
resize.
Let's start off implementing alloc because it's the simplest.
Here is the assembly code:
alloc:
; allocate size bytes
; set the last 8 bytes to size
sub rsp, rbx
mov qword [rsp], rbx
; jump back to the caller
jmp rax
Because of our unique constraints, we generally aren't allowed to use
call, push, pop or any assembly instruction that modifies the stack.
Instead we can use jmp instructions and store the return address in a register.
The return address is given in rax and the amount (in bytes) to allocate is given in
rbx. Note the mov qword and how the last 8-bytes of the allocation are used for the size. The amount to allocate in rbx has to include an extra 8-bytes for the size tag.
And here is the implementation for pop.
fn pop() {
rsp -= *rsp;
}
pop:
add rsp, [rsp]
jmp rax
Now, implementing resize is going to be a little more complicated. For the sake of brevity I'll
just show the pseudo-code, but you can imagine how this would translate to assembly.
fn resize(ptr: *const void, amount: i64) {
// the stack moves downwards
// we start at i=rsp and move up to ptr
for (let i = rsp; i < ptr; i += 8) {
*(i - amount) = *i;
}
// update the size and rsp
let size = *ptr;
*(ptr - amount) = size + amount;
rsp -= amount;
}
We'll start at rsp and move backwards
until we reach ptr. We'll move all the data we meet along the way down by the given
amount.
<DATA> <DATA> <SIZE=2> <DATA> <SIZE=1>
↑ ↑
ptr rsp
<DATA> <DATA> <SIZE=2> <DATA> <SIZE=1>
↑ ↑
ptr rsp
Then we set the new size and update rsp.
<DATA> <DATA> <DATA> <SIZE=3> <DATA> <SIZE=1>
↑ ↑
(ptr - amount) rsp
Nice! Dynamic stack memory! But, there is still one more issue, and that is pointers. If we resize memory before a pointer, then everything get's pushed around and our pointer no longer points to the right object. Oh No!
<DATA> <SIZE=1> <DATA> <SIZE=1>
↑
ptr (2nd item)
<DATA> <NEW DATA> <NEW DATA> <NEW DATA> <SIZE=4> <DATA> <SIZE=1>
↑
ptr (invalidated)
The solution? Instead of storing the address of our data, we can store the index. Then we can convert that index into an address when we need to use it.
We can convert an index into an address by calculating how many steps backwards our data is from the end of
the stack with
length - index - 1. For example, if our stack has 3 elements and we want to find
the data at index
1 we would need to move 3-1-1 = 1 step backwards from the end of the stack. Here
is
the pseudo-code for converting an index into an
address.
fn idx2addr(idx: u64) -> u64 {
let ptr = rsp;
for (let i = 0; i < length-idx-1; i++) {
ptr -= *ptr;
}
return ptr;
}
Note that this means we have to keep track of the length of our list in a register somewhere. Since we
aren't using rbp for anything else, I've designated rbp as our length register.
We'll update alloc and pop to increment/decrement rbp.
For deeply nested data-structures like &arr[0][1][2], we'll have to store the reference as a
list of indicies. This means that references to deeply nested data-structures can get quite long, but that's a sacrifice I'm willing to make.
You can download an ASM code example here, but you may need an exotic debugger to be able to visualise how stack memory changes.
a language
We have an issue. Writing assembly is painful, but no language supports our funky stack, they all expect the stack to behave in a certain way. This is where the custom programming language comes in. It doesn't need to be fancy, just a basic wrapper around assembly. Like C but for stack shenanigans.
I'll be glossing over the whole "making a language" part in this blog post. I highly suggest you try following Make a Lisp and LLVM Kaleidoscope if you're feeling up for a challenge. They'll teach you a lot about writing your own languages. You can see my own implementation of "Make a Lisp" here.
This is going to be quite difficult...
For simplicity, the language will compile to NASM assembly. First we have a tokenizer. It'll convert a stream of text like:
u32 inc(u32 a) {
return a+1;
}
Into a list of tokens, like:
[
Symbol("u32"), Symbol("inc"),
Operand("("), Symbol("u32"), Symbol("a"), Operand(")"),
Operand("{"),
Return, Symbol("a"), Operand("+"), Number(1), Operand(";"),
Operand("}"),
]
I decided to give the language C-like syntax because C is familiar and easy to parse. Now we have a stream of tokens, it'll make the next step a little bit easier.
We need to convert our stream of tokens into an abstract syntax tree (AST). For example, the abstract syntax
tree of 2+3*4 would look something like:
Sym(+)
/ \
/ \
/ \
Num(2) Sym(*)
/ \
/ \
/ \
Num(3) Num(4)
There is a section in LLVM Kaleidoscope about parsing "binary expressions", and the first part of this youtube video shows how to implement a binary expression parser. If you want to try this too and are a beginner, I highly suggest using a lisp-like syntax since it is much easier to parse.
Now we have an AST that can represent function definitions, variable assignments,
expressions, and control flow operations (if, while, for). We need to write a code-generator that generates
our NASM assembly. It'll take in our AST and output NASM. Then we can call the nasm command
line tool directly to
compile it into a binary.
The code-generator was the hardest part to do. It is so hard to debug thousand-line assembly files where no debugger works because your stack is too screwed, which is why I wrote my own (wink wink nudge nudge).
Let's rewrite some of our assembly functions in our code-generator. For our alloc function, we'll make sure to keep track of the stack size and return the index of the allocated data in the stack.
struct Codegen {
stack_size: usize,
}
fn alloc(codegen: &Codegen, size: u64) -> usize {
let idx = codegen.stack_size;
// allocate size bytes
// set the last 8 bytes to size
write_asm!("sub rsp, {size}");
write_asm!("sub qword [rsp], {size}");
codegen.stack_size += 1;
return idx;
}
And the pop function follows in the same vein.
fn pop(codegen: &Codegen) {
write_asm!("sub qword [rsp], {size}");
codegen.stack_size -= 1;
}
Then we can implement our idx2addr function. It will take in an index and return the address in a register.
fn idx2addr(codegen: &Codegen, idx: usize, reg: &str) {
write_asm!("mov {reg}, rsp");
for (let i = 0; i < codegen.stack_size-idx-1; i++) {
write_asm!("add {reg}, [{reg}]");
}
}
Now we can start with code-generating a simple number. We'll need to allocate some space on the stack and then put that number in it.
fn codegen_number(codegen: &Codegen, number: Number) -> usize {
let idx = codegen.alloc(16);
codegen.idx2addr(idx, "rbx");
write_asm!("mov qword [rbx], {number}");
return idx;
}
Now let's codegen an add function.
fn codegen_add(codegen: &Codegen, a: Expression, b: Expression) -> usize {
// codegen the expressions
let a = codegen.codegen_expression(a);
let b = codegen.codegen_expression(b);
// dereference the indicies
codegen.idx2addr(a, "rbx");
codegen.idx2addr(b, "mov qword [rsp], 16");
write_asm!("mov rbx, [rbx+8]");
write_asm!("mov rcx, [rcx+8]");
// add the numbers
write_asm!("add rbx, rcx");
// allocate the result
let idx = codegen.alloc(16);
codegen.idx2addr(idx, "rdx");
write_asm!("mov qword [rdx], rbx");
return idx;
}
Now we'll write the expression generator.
fn codegen_expression(codegen: &Codegen, expression: Expression) -> usize {
match expression {
Number(num) => codegen.codegen_number(num),
Add(a, b) => codegen.codegen_add(a, b),
}
}
And from here it's easy to imagine how the rest of the language was implemented.
Of course, our code-generator will be generating assembly that utilises our funky stack.
void main() {
1 + 2;
}
Gets converted roughly into:
_start:
; allocate 'a' on stack
sub rsp, 16
mov qword [rsp], 16
mov qword [rsp+8], 1
; allocate 'b' on stack
sub rsp, 16
mov qword [rsp], 16
mov qword [rsp+8], 2
; dereference indicies
mov rbx, rsp
mov rcx, rsp
add rcx, [rcx]
mov rbx, [rbx+8]
mov rcx, [rcx+8]
; add numbers
add rbx, rcx
; allocate result on the stack
sub rsp, 16
mov qword [rsp], 16
mov qword [rsp+8], rbx
; exit(1)
mov rax, 0
mov ebx, 1
int 0x80
This part (making the language) was the most hardest part by far. Writing the hash-map is a cake walk in comparison.
a cake walk (in comparison)
hash-maps are simple actually, they're just arrays with clever indexing. We take a key, hash it, modulo it over the length of the array, and that's our index. We check for a hash collision and if there was, resize the array. If there wasn't just insert. Here is the pseudo-code:
struct Bucket {
i32 key,
i32 value,
}
struct HashMap = []Bucket;
void hashmap_insert(&HashMap map, i32 key, i32 value) {
// get a random index
u64 i = hashfunction(key) % map.size;
if (map[i].key == key or map[i].is_empty()) {
// if the key matches or the entry is empty.
// then insert!
map[i].key = key;
map[i].value = value;
} else {
// if the key doesn't match and the entry isn't empty.
// then we have a hash collision! let's resize the hashmap and try again.
hashmap_resize(map, map.size * 2);
hashmap_insert(map, key, value);
}
}
The hashmap_resize function is just going to create a new hashmap and reinsert all the items:
void hashmap_resize(&HashMap map) {
HashMap new_map = new HashMap();
for (i32 i = 0; i < map.size; i++) {
if (!map[i].is_empty()) {
hashmap_insert(&new_map, map[i].key, map[i].value);
}
}
*map = new_map
}
There are many different techniques that can avoid extra reallocation by using adjacent slots or more "spread-out" hash functions. Honorable mention: Fibonacci Hashing
Conclusion
That's it! This is the end of the post. Obviously, this is completely useless and no one would want to use a stack like this because of the *awful* performance it gives. I mean look how many instructions a simple stack allocation takes... and oh god, the time complexity! It was fun though!