How to Enable 64bit Mode on x86

Written by mvuksano | Published 2020/08/07
Tech Story Tags: x86 | virtualization | assembly | programming | hackernoon-top-story | 64bit | hacking | enable-64bit-mode | web-monetization

TLDR How to Enable 64bit Mode on x86: How to enable 64bit mode on the x86 platform. In this article we will enable a CPU to run in 32bit mode with paging enabled. In long mode page tables contain 8 byte entries compared to 4 byte entries in 32 bit mode. We need only one table which will host 16 entries (0x1000 / 0x1000 = 0x10 = 16) We will identity map these areas and use them to create tables.via the TL;DR App

In the previous set of articles we have worked our way through configuring a vCPU and getting it to run in 32bit mode with paging enabled. In this article we will take it a step further and enable 64bit mode.
Before we can run a CPU in 64bit mode we need to "reconfigure" our page tables. Don't worry - this exercise should be a walk in a park compared to the previous one (enabling paging and switching CPU to 32bit mode).

64bit vs 32bit page tables

Before we dive into implementation lets have a look how is paging different in 64bit mode to the one used in 32bit mode[1].
In 32bit mode we used three level paging scheme. The tables were called
1. Page Directory Table (PDT)
2. Page Directory (PD)
3. Page Table (PT)
We used one part of virtual address (VA) in combination with CR3 to locate an entry in PDT called PDTE. Then we used PDTE in combination with another part of VA to locate an entry in PD (PDE). From there we went on and used yet another part of VA in combination with PDE to locate an entry in PT (PTE). And finally we combined PTE with VA to come up with physical address (PA). In summary, we traversed table structures in this order: PDT -> PD -> PT -> PA (Physical Address).
In 64bit mode x86 uses 4 level paging scheme in which tables have the following names:
1. PML4 (Page Map Level 4 Table)
2. PDP (Page Directory Pointer)
3. PD (Page Directory)
4. PT (Page Table)

In this scheme tables are traversed in order PML4 -> PDP -> PD -> PT.
Also keep in mind that in long mode page table contains 8 byte entries compared to 4 byte entries in 32 bit mode. That means each of the tables can be up to 2^9 * 2^3 = 4096 (or 0x1000) bytes large
The following image from Intel System Programming Guide best illustrates the process.
Here you can find an illustration that works through an example where 0xc000 is identity mapped.

Create page tables

Firstly, let's change our page tables. Instead of using uint32_t byte which is 4 bytes in size we will use uint64_t.
In this example we have 0x10000 (65536) bytes that are allocated for the VM. We will identity map these area.
The addresses that will be used for tables are 0x1000, 0x2000, 0x3000 and 0x4000 for PML4, PDP, PD and PT respectively. First three tables will have only one page that points to the following one. That means that PML4 will have one entry pointing to PDP. PDP will have one entry pointing to PD. PD will have one entry pointing to PT. Reason is that each of these tables can address up to 2^9 = 512 entries. This means we need only one PT which will host our 16 entries (0x10000 / 0x1000 = 0x10 = 16).
void createPageTable(void *mem) {
	uint64_t pml4e = 0x2000 | 0x3;
	memcpy(mem, &pml4e, 8);

	uint64_t pdpte = 0x3000 | 0x3;
	memcpy(mem + 0x1000, &pdpte, 8);

	uint64_t pde = 0x4000 | 0x3;
	memcpy(mem + 0x2000, &pde, 8);

	uint64_t pte_1 = 0x0000 | 0x3;
	memcpy(mem + 0x3000, &pte_1, 8);

	uint64_t pte_2 = 0x1000 | 0x3;
	memcpy(mem + 0x3008, &pte_2, 8);

	uint64_t pte_3 = 0x2000 | 0x3;
	memcpy(mem + 0x3010, &pte_3, 8);

	uint64_t pte_4 = 0x3000 | 0x3;
	memcpy(mem + 0x3018, &pte_4, 8);

	uint64_t pte_5 = 0x4000 | 0x3;
	memcpy(mem + 0x3020, &pte_5, 8);

	uint64_t pte_6 = 0x5000 | 0x3;
	memcpy(mem + 0x3028, &pte_6, 8);

	uint64_t pte_7 = 0x6000 | 0x3;
	memcpy(mem + 0x3030, &pte_7, 8);

	uint64_t pte_8 = 0x7000 | 0x3;
	memcpy(mem + 0x3038, &pte_8, 8);
	
	uint64_t pte_9 = 0x8000 | 0x3;
	memcpy(mem + 0x3040, &pte_9, 8);

	uint64_t pte_10 = 0x9000 | 0x3;
	memcpy(mem + 0x3048, &pte_10, 8);

	uint64_t pte_11 = 0xa000 | 0x3;
	memcpy(mem + 0x3050, &pte_11, 8);

	uint64_t pte_12 = 0xb000 | 0x3;
	memcpy(mem + 0x3058, &pte_12, 8);

	uint64_t pte_13 = 0xc000 | 0x3;
	memcpy(mem + 0x3060, &pte_13, 8);

        uint64_t pte_14 = 0xd000 | 0x3;
	memcpy(mem + 0x3068, &pte_14, 8);

        uint64_t pte_15 = 0xe000 | 0x3;
	memcpy(mem + 0x3070, &pte_15, 8);

        uint64_t pte_16 = 0xf000 | 0x3;
	memcpy(mem + 0x3078, &pte_16, 8);
}
The above code is not the most efficient way to do memory mapping. Normally you would populate tables using loops but because goals of these examples is teaching the code is left verbose on purpose.

Enable long mode

To enable long mode we need to do three more things.
1. Set PAE bit in CR4
2. Set LMA bit in EFER register.
3. Set L bit in CS
To set those registers we will read special registers using KVM_GET_SREGS ioctl.
Setting PAE bit is straightforward. We read current value and xor it with 0x20 (bit 5 set to 1).
sregs.cr4 = sregs.cr4 | 0x20;
Setting LMA bit in EFER comes with a caveat. Most documentation will tell you that need to set bit 9 in EFER register to enable long mode. That is true when working with bare metal (no VM). For a VM to enter long mode you need to set bit 9 and 11.
To understand why take a look at description of EFER register in "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers".
You can see that when 11th bit is set it tells CPU that long mode (IA-32e) is enabled. When working with a CPU you can enable long mode but then you should check if CPU really enabled it. When working with VM setting bit 9 without setting bit 11 will not even work.
Now we can use the same technique that we used with CR4 register. This time we will be xor-ing with value 0x500 (bit 9 and 11 set).
sregs.efer = sregs.efer | 0x500;
Lastly we need to set L bit in CS (code segment) register. It is sometimes misunderstood how code segments are used in IA-32e mode. Specifically it's often thought that they don't exist or are not used. Truth is that some fields are ignored (e.g. base address and limit fields), some are treated as 0 in some calculations and the remaining bits are used normally.
Code segment descriptors and selectors are needed in IA-32e mode to establish processors operating mode and execution privilege level.
What is relevant in our case is that setting bit L of the CS register and having IA-32e mode active means that processor starts using default address size of 64 bits and default operand size of 32 bits[R2]. In case we do not set this bit out processor will operate in 64 bit compatibility mode.
sregs.cs.l = 0x1;
And that's it. Now our VM is able to run 64 bit programs.

Run program

Finally, we need to compile a program that we can run. In my previous article I've spoken about differences between 16/32/64 bit programs and how to compile for each architecture. Let's use two toy programs in this example too.
The first program will take two numbers in registers rax and rbx, compare them and output either N or Y depending whether the numbers are equal or not. From there it will jump to a different program (b.asm) that is located at memory address 0xc000. That program will output E (for end) and halt the machine.
;a.asm
BITS 64
mov rax, 0x100000000
mov rbx, 0x200000000
add rax, rbx
mov rbx, 0x200000000
cmp rax, rbx
jz .equal

mov rax, 'N'
mov edx, 0x3f8
out dx, al
mov rax, 0xc000
jmp rax

.equal:
mov rax, 'Y'
mov edx, 0x3f8
out dx, al

mov rax, 0xc000
jmp rax

;b.asm
BITS 64

mov rax, 'E'
mov edx, 0x3f8
out dx, al
hlt
Each of the programs can be compiled using
nasm
:
nasm -O0 -o a.bin a.asm
nasm -O0 -o b.bin b.asm
The
-O0
tells
nasm
to not use any optimizations. This is to ensure that we run 64 bit instructions rather than potentially optimizing out some instructions and using the same ones as in 32 bit mode. During normal operation you would let compiler optimize the code but in this case we want to be sure the code that we run is 64 bit and that vCPU is running in true long mode (IA32e / 64 bit mode).
Finally, if we run our program we will be presented with the following output:

Conclusion

This brings us to the end of yet another challenge. Our VM can now run 64 bit programs. You can already use it to run a lot of useful programs. To make it even more useful, in the next article we will look into exception handling. Stay tuned!

Notes:

1. Technically there is one more variation of paging in 32bit mode in addition to the one we used in previous article. You can check out chapter on paging from Intel x86 manual [R1]

References:

[R1] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide
[R2] Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide - 5.2.1 Code-Segment Descriptor in 64-bit Mode

Written by mvuksano | PSS - Pragmatic problem solver @ Facebook
Published by HackerNoon on 2020/08/07