Saturday, 20 December 2014

Compiler 2 Part 9: Assembly Primer

Assembly is a huge topic and there are entire classes dedicated to the subject. It would be impossible to teach assembly in a single post but what I hope to achieve is enough of an overview to make the C code that’s going to be output by the compiler, and the C runtime, make some sense.

This article will be concerned with the x86 instruction set and architecture. There are several different syntax’s (Intel and AT&T for example) for assembly but I’ll try and keep the code as language agnostic as possible.

Sizes


Everyone ought to be familiar with a byte.

A byte is made up of a number of bits which represent the smallest unit of information. The vast majority of modern desktop PCs have 8 bit bytes. A single bit has two states: 1 or 0. A single 8 bit byte has 256 different combinations of bits.

Since a bit has two states, grouping multiple bits together can determine how many different combinations can be achived. In the case of an 8 bit byte, we calculate it by raising two to the power of eight, like so: 2^8 = 256. Therefore, a single byte has the ranges: -128 to 127 (signed) or 0 to 255 (unsigned). It is represented by a ‘b’.

A word is two bytes. In C this is the short type. It has 2^16 combinations or byte*byte. It has the ranges -32,768 to 32,787 or 0 to 65,535. It is represented by a ‘w’.

A long is two words or four bytes. In C this is usually the int or long type. It has 2^32 combinations and is represented by an ‘l’ (lower case L).

A quad is two longs, four words or eight bytes. In C it is usually the long long type. It has 2^64 combinations and is represented by a ‘q’.

Registers


There are eight basic, general purpose registers and they are follows:

Register
Name
Example Usage
ax
accumulator
arithmetic, return value
bx
base
data pointer, temporary storage of intermediate value
cx
counter
loops
dx
data
temp storage
sp
stack pointer
points to the top of the current stack frame
bp
base pointer
points to the bottom of the current stack frame
si
source index
source array index
di
destination index
destination array index

These registers can be accessed in various sizes (byte, word, long and quad). A quad sized (64 bit) register has an ‘r’ prefix, a long sized (32 bit) register has an ‘e’ prefix and both the word and byte sized registers have no prefix. The target architecture restricts which registers are available. As an example, 64 bit registers are not available on 32 bit machines.

The four registers ending in x can also have their hi and low bytes accessed for storing byte sized information. In these cases, the x is replaced by an h for the high byte and an l for the low byte.

Using the ax register as an example:


Register
8
7
6
5
4
3
2
1
Desc.
al
-
-
-
-
-
-
-
Y
byte (low)
ah
-
-
-
-
-
-
Y
-
byte (high)
ax
-
-
-
-
-
-
Y
Y
word
eax
-
-
-
-
Y
Y
Y
Y
long
rax
Y
Y
Y
Y
Y
Y
Y
Y
quad

The numbers 1 through 8 represent the bytes accessed.

64 Bit General Purpose Registers


In 64 bit mode, the registers R8 through R15 are also available. The eight general purpose registers discussed above constitute the first eight registers, 0 through 7. In 64 bit mode we have access to these additional eight registers, 8 through 15.

To access the 32-bit version of these registers requires the D (double-word) suffix (r8d), a W for 16-bit (r8w), and a B for byte access (r8b).

As an example of usage of these registers, the version of GCC installed on my machine will use the rcx, rdx, r8 and r9 registers as the 1st, 2nd, 3rd and 4th arguments to functions, respectively, when compiling for 64 bit.

Instructions


There are a lot of instructions but here are the ones needed to do some basic arithmetic.

Instruction
Action
mov a, b
Copies the data stored in a to b
add a, b
Add b to a and store it in b
sub a, b
Subtract b from a and store it in b
mul a, b
Multiply b by a and store it in b
div a, b
Divide b by a and store it in b

Depending on the language, the operands might be reversed (mov might copy a into b in one language but b into a in another). The way the registered are addressed in one assembly language or another might be different but rest assured if you understand one you shouldn’t have much trouble figuring out another.

An example of 2+3 in pseudo-code:

mov 3, eax (set eax to 3)
mov 2, edx (set edx to 2)
add edx, eax (add 2 to 3 and store it in eax)


Each instruction might have a specific size it operates on. So the mov function would have either b, w, l or q appended to the instruction name (movb, movw, movl and movq), as would the arithmetic operations.

Fun Fact: Another name for an instruction is opcode. Opcodes are mnemonics for an instruction code. Movl is actually the number 0x89.

Segments


There are six segment registers:



ss
stack segment, holds the address of the stack
cs
code segment
ds
data segment
es
extra segment
fs
extra segment (e+1 = f)
gs
extra segment (e+2 = g)

You don’t really have to worry about these too much. The stack segment is the only one we’ll be dealing with, and indirectly at that.

Stack


It took me some time to really wrap my head around how the stack works.

The stack segment holds a pointer to the first address in the stack. It doesn’t matter where this memory came from. It works like any other memory.

The stack space is separated into segments called frames. A frame is used to hold all the data needed for a function. This includes, but isn’t limited to, local variables, the return address of the function and a pointer to the previous frame.

The sp register, the stack pointer, holds the address of the top-most location on the stack. This is the top of the current, or active, frame. The bp register, the base pointer, holds the address of the bottom of the frame. The active frame is the address space between the sp and bp registers.

Stack Frame


In a stack data structure you push (add) and pop (remove) elements to and from the stack. I would describe manipulating the stack in assembly as more fluid or elastic.

A push or pop instruction will push a single element to the stack but that element might not necessarily be the same size as another element. Depending on other instructions you may need to align the stack pointer to the nearest 16 bits.

You can also request stack memory without a push or pop instruction at all. You simply take what you need and mark off the area with the bp and sp registers.

Caution


Assembly will let you do pretty much anything. It will hand you the gun, load a bullet into the chamber and hold your hand while you shoot yourself in the foot.

When adjusting the stack or base pointers, for example, you can set them to whatever memory addresses you want. Nothing will stop you from trying. That’s not to say it won’t have disastrous effects but it’ll let you do it.

Assembly has no notion of types. Want to add a character to a memory address? Ok! Want to overwrite part of a 64 bit value with a 16 bit value? Ok! It doesn’t care. It will obediently do what you tell it to. Consider this your forewarning.

Use extra, extra caution when programming in assembly.

Purpose


Why am I telling you all this?

Well, you might want to compile directly to assembly in your own project and this serves as a short primer on the subject. More importantly, as already stated, it is going to serve as a basis for the C runtime that has been provided.

As it turns out, assembly is useful for compiling languages. For one, it deals with data in a type agnostic way. As previously stated, assembly deals with values and memory addresses but it has no notion of type. It’s all just bits and bytes.

Additionally, the way registers work and how the stack is accesses is also of use. I won’t get into any details right now but passing arguments and handling variables is much easier than trying to translate them into a proper C equivalent function prototype.

Consider this: A language like Python is dynamically typed. C is statically typed. How do you translate a Python function prototype into a C prototype without knowing the value of the variables being passed at runtime a priori?

One answer is to use the void pointer type but I think a better answer to not use parameters at all.

I can’t say I’ve really done assembly much justice. There are heaps more to understand before I think you can really “get” assembly. If you’re interested in the subject, I highly recommend doing some searches.