Reverse Engineering a Gameboy Advance Game: Saving the custom level — Part 8

Bruno Macabeus
15 min readFeb 10, 2021

This post is part of a series entitled Reverse Engineering a Gameboy Advance Game. Read the introduction here. Read the previous post here.

Follow me on Twitter to more computer fun 🐦

After our glorious trajectory doing reverse engineering on the game Klonoa, we finally achieved our aim: we created a web app to customize its levels, klo-gba.js!
From now on it’s usable because we can create new challenges in the game. But there is a limitation on our current implementation: when it saves a custom level, the next one won’t work anymore. For instance, if we update the first level, the game will crash when trying to access the second level.

Even though I was aware about this annoying bug, I launched some alpha versions of klo-gba.js, because I wanted to get users’ feedback, and seeing people using my product is good to motivate me to keep developing.
I already suspected what was causing this bug, and I was aware that it would be hard to fix (and very fun, by the way). Only on the fifth release this bug was fixed.

Okay, but what was causing this bug?
Well… Until then I was applying a very simple approach to save the customized levels: just replacing the original level’s compressed data with the customized one. So the game will be able to load our level without needing to do any additional changes.
But we have the following problem: the levels are stored one after another in memory, that is, right after a level’s last byte is the beginning of the next level’s data. When we edit a level, its data could be larger than the original, so the last bytes will overwrite the beginning of the next level!

Although we change the level data’s size, the GBA is smart enough to determine how much data needs to be decompressed. So we don’t need to specify the new size. It can find that on its own. Therefore, changing the level data’s size isn’t, on its own, a problem. The real problem is that it’s overwriting the data of the next level.

Another curious example is that if we overwrite the first level and then the second level, the first level won’t work anymore (because we overwrote its end) and the third level won’t work either (because we overwrote its beginning). Only the second level will run!

I started with this hypothesis, and after that I looked for evidence that it was really the root of the problem. For that I just needed to compare the size of the original and customized level data.

Something that caught my attention is that, even without changing anything in the level, the compressed data will still be larger than the original!
I searched for some configuration in the old C code to see if I could generate a smaller buffer that would fit on the original slot, but it didn’t work… The new data is always larger than the original.

So I thought of another approach to solve the problem, which was really much more complex, but it was the only idea that came to mind: what if we patched the original code that loads the levels to redirect the level starting address to another memory region?
“Patch”, in the reverse engineering context, is when we modify the original assembly of a binary. Let’s say that it’s a glorified monkey patch

Okay. So where can we store the custom levels? Easy! Despite the ROM’s size being very small, at its end there is a lot of empty space, so we’ll use it now to store the custom levels as well as the instructions we’ll code now!

Wow, it sounds so cool! We’ll need to do a lot of reverse engineering stuff! For that, I’ll need to explain a little more about the processor we are working on, ARM7TDMI.

Let’s talk about ARM7TDMI…

This hardware implements the ARMv4 instructions, which, in turn, support two instruction sets: Thumb and ARM. The majority of GBA games (including Klonoa) are written mainly in Thumb, because these instructions are faster to run and require less memory; two important traits when working on limited hardware like the Gameboy Advance.
Besides Thumb there are ARM instructions, which can run more complex operations, such as storing bigger numbers and doing more specialized conditionals.
In order to switch between Thumb and ARM it is necessary to use the instruction bx (Branch and Exchange). Important: it takes as a parameter one register which says what the new program counter (PC) address will be.
I had a lot of challenges discovering how to do this switch, because the official ARM doc is complex and each person in the community said something different; one of the most useful documentations to help me was the No$GBA technical doc and this ARM doc.

In the fourth chapter, Where is the Tilemap in the ROM? , we found the instruction that loads the game level. I’ll speak a little more about it.
Here is the part of the code that loads a level (we’ll call it the level loader from now on):

08043B0A mov r4,r0     ; move to R4 the value from R0
08043B0C add r0,r5,4 ; store to R0 the value of R5 + 4
08043B0E mov r1,r4 ; move to R1 the value from R4
08043B10 bl 0805143Ch ; create a branch and move PC to 0805143C

The function called at 08043B10 is the function that decompresses the level, using the swi instruction we spoke before. Remember that it uses the address stored in R0 as input.
So in order to redirect the tilemap data we should change the R0 value to the address of the custom tilemap!

Since we need more space to store our customized level loader, we need to move the program counter to another memory region. Besides that, there is a nice feature in ARM which could help us to have more readable code: conditional execution, and I’ll explain now what it is.

In ARMv7 assembly there are condition flags such as N (negative) and Z (zero). Some instructions update these flags based on their results. One example of these instructions is cmp (compare). The processor uses these flags to determine if it should run conditional branch instructions, such as beq (Branch if Equal).
But there are moments when we could use something simpler than branching, for example, when we want to conditionally execute just a single instruction. In this case, we can apply conditional execution.
To do that we just need to add a suffix to the instruction, such as eq (“equal”) or ne (“not equal”). Let’s see a plain instance on ARM:

cmp   r0,5h  ; compare if R0 is equal to 5
; this updates the state of the condition flags
moveq r0,10h ; if they are equal, move to R0 the value 10
; this reads the state of the condition flags

If you want, you could read more about that here. It’s possible to do conditional execution using Thumb, but it requires more verbose code.
Originally I wrote the code using Thumb, and afterward rewrote it using ARM, and then I realized that it ended up being more readable. In order to simplify the post, I won’t talk about the original Thumb code; I’ll start directly with the ARM version.

Okay. To run our algorithm we need to load the tilemap address into a register, in order to determine whether we should update it or not. The address is very big, for instance, 0x081B27FC. Although it looks very easy to move this number to a register… nope, it isn’t trivial at all.

In the above assembly code, we wrote a few constants directly in the instruction: 5 and 10. It’s called immediate value, that is, constants that are written directly in the instruction payload. The first instruction moves the constant 5 to the register, and the second one moves 10. Easy, right?
But there are some limitations! There is a value limit to the immediate value, and this limit is different depending on the instruction and whether it is ARM or Thumb. For example, the instruction mov in Thumb can receive an immediate value with 8 bits (up to 0xFF), while in ARM it’s 16 bits (up to 0xFFFF). For instance, the following code just doesn’t exist in Thumb, we can’t write it and an assembler will raise an error. But, in ARM it’s valid:

mov r0,100h ; move the immediate value 0x100 to r0

But we need to store memory addresses in the registers, that is, 32-bit values. Even using ARM instructions we won’t have enough space! So how do we solve that? Initially, I wrote some very ugly code using bit-shifting (store the upper 16 bits, left-shift by 16 bits, and then add the lower 16 bits)… okay, it works, but there must be a better way to solve it, right?
Yeah, there is! Searching more, I learned that we can expand these limits using a literal pool. This approach is about writing our constants to a region of memory and then referencing it on the instruction. For example, if we want to send a value bigger than 16 bits to a register we should use the operator ldr with the parameter R15 (PC) plus an offset. This notation is:

ldr r0,[r15,#-3] ; load to R0 the value at [address in PC minus 3]

If we write this in No$GBA it’ll show the formatted value in the instructions list, which helps us to read the code. Look at the following screenshot:

With this the ldr loads about 32 bits. Again, remember that ARM is little endian, so the byte order is a little confusing.
The 03 00 1F E5 (E51F0003 in the table) is the instruction code of ldr r0,[r15,#-3]; keep this in mind, because we’ll chat more about it soon.

As usual, we have some limitations. In Thumb the offset is an 8-bit value, while in ARM it is 12-bit. Therefore, the constant we want to load should be at an address within 2⁸ (256) bits or 2¹² (4096) bits of the address of the ldr instruction.

For some unknown reason, No$GBA allows us to directly write commands like ldr r0,=1000h when editing an instruction, but it isn’t necessarily valid! It doesn’t correctly write a literal pool, but it “just works”. It confused me a lot while I was studying…

Normally what creates and manages the literal pool is an assembler, but since we don’t have one, we need to manage it manually.

Patching the level loader

The above brief context about ARM is enough to understand the implementation of the level loader patch that we’ll do now.

As a first step we need to move the PC (program counter) from the original level loader to another memory region where we’ll store our custom level loader.
I could only find an empty space veeery far away, around the address 08367610. Since I wanted to write using ARM (and the original level loader was written using Thumb) we need to use the bx instruction in order to do the swap, but it receives just one register, and it isn’t possible to store the value 08367610 in one register using just one line (as I explained above), and there isn’t enough space to do this patch in the level loader.
The solution that I found to move the PC far away and also to switch to ARM was the following: the instruction bl (Branch with Link) moves the PC, and it’s one of few instructions in Thumb that can take a 4-byte parameter, that is, the argument of the address the PC will move to can be the immediate value 08367610.
Did you notice that the full instruction name is “Branch with Link”? So, besides updating the PC to a new address, it also stores the previous PC address to R14. Then we can return to that previous address, including returning to Thumb, using the instruction bx r14.

Note that we overwrote two instructions in the original level loader to store the bl instruction, because it takes 4 bytes, while the add and mov take 2 bytes each. Eventually we’ll need to run these two overwritten instructions.
And note that, at this moment, we have two registers available: R0 and R1, because they receive values that we can calculate later.

The routine that does the switch from Thumb to ARM is the following:

08367610 mov r0,r15 ; moves into R0 the PC value (R15 is the PC)
08367612 add r0,3Ch ; adds 3C into R0
08367614 bx r0 ; moves to PC the value from R0 and switches to ARM

The above code does the switch to ARM and moves the PC 3C bytes forward.

Okay, now we need to write the function that runs the two original instructions from the original level loader and, more importantly, changes the R0 value to the address of the customized level! Which algorithm will we use to do this change?
This will be a very simple one… but since we are writing in assembly and without an assembler, anything becomes hugely complex…

We have a table of constants, which is our literal pool. It stores the address of the original level and the respective address of the customized level. The structure is very simple: line N is the original address, line N + 1 is the respective customized address.

We check to see if R0 matches a value in this table, checking the odd rows (first, third, and so on). If R0 matches a row, we replace its value with the next row in the table (second, fourth, and so on).

After many hours of suffering in writing the assembly code, here’s the result:

; original level loader code, sets r0 to r4 + 4
08367650 add r0,r4,4h
; each of these blocks does the following:
; - load into R4 a row from the table
; - compares R4 with R0
; - if they are equal, load the next row from the table into R0
08367654 ldr r4,[r15,#-3Ch]
08367658 cmp r0,r4
0836765C ldreq r0,[r15,#-40h]
08367660 ldr r4,[r15,#-40h]
08367664 cmp r0,r4
08367668 ldreq r0,[r15,#-44h]
0836767C ldr r4,[r15,#-44h]
08367680 cmp r0,r4
08367684 ldreq r0,[r15,#-48h]
; original level loader code
08367688 mov r4,r1
; switch to Thumb and return to the level loader
0836768C bx r14

Fantastic! This assembly code works very well! It solves our problem!
One tip for when you are going to write your own assembly code using No$GBA: since it isn’t exactly a “text editor” and moving an instruction to another line isn’t convenient, you should always leave some lines empty in case you eventually need them.

Okay. We manually wrote the code in No$GBA. But how can we do it automatically in klo-gba.js when we save a custom level?

In order to simplify, I decided that each level should have a specific address to save its custom tilemap, even though we could calculate it dynamically.
So, in the json-like files that store the data for each level, we now have the property rom.customTilemap.

Okay. Let’s write the function setPatchCustomVisionLoader to automatically write the patch which we already know works.

One of the first things that should be done is to check if a custom tilemap was written, in order to avoid unnecessary redirects. To do this, I wrote a very simple function:

const visionHasCustomTilemap = (romBuffer, visionInfo) =>
romBuffer[visionInfo.rom.customTilemap[0]] !== 0x00
const setPatchCustomVisionLoader = (romBuffer) => {
const visionsWithCustomTilemap = allVisions.map(visionInfo =>
visionHasCustomTilemap(romBuffer, visionInfo))
...
}

Besides that, it will help us to have an array containing the original tilemap address and its respective custom address, to be used to fill the table mentioned before.

const setPatchCustomVisionLoader = (romBuffer) => {
...
const addresses = allVisions.map(visionInfo => ({
custom: mapAddressToRomOffset(visionInfo.rom.customTilemap[0]),
original: mapAddressToRomOffset(visionInfo.rom.tilemap[0]),
}))
...
}

Okay. Now we’ll begin the most amazing part: let’s write assembly code into a ROM using JS!
Since we’ll write assembly directly, without using an assembler, we can’t use mnemonics, so we need to use the instruction’s numeric code. As you probably already know, each assembly instruction is just a sequence of numbers. For instance, mov r0,r15 is translated by the assembler to 0x4678. In this case, because ARMv7 is little-endian, the most significant bytes are written to the left, so we should write it as [0x78, 0x46].
There are a lot of tables that explain how to map an instruction to its number, but it would be too boring to do it manually. An easier way to get the numeric sequence is to write the assembly instructions in No$GBA. As we write the instruction, No$GBA displays the corresponding numeric instruction code.

Screenshot of No$GBA’s code window. Look at the second column values.

Firstly, we should patch the original level loader and write the code to switch to ARM. Remember that some instructions require 4 bytes, while others require 2 bytes.

const setPatchCustomVisionLoader = (romBuffer) => {
...
// set bl to go to our patch
romBuffer.set([0x23, 0xF3, 0x80, 0xFD], 0x43B0C) // bl 08367610h
// switch to arm mode and run our code
romBuffer.set([0x78, 0x46], 0x367610) // mov r0,r15
romBuffer.set([0x3C, 0x30], 0x367612) // add r0, 3Ch
romBuffer.set([0x00, 0x47], 0x367614) // bx r0
...
}

Did you notice something strange in the above code? No!? Well, look at the second argument on the set call! It says where we’ll write in the ROM data, and we are writing to addresses like 0x43B0C, but as we saw in No$GBA, the ROM data is located at addresses from 0x08000000!
What is happening is: the Gameboy Advance maps the ROM addresses to the addresses starting with 0x08 to be run by the processor. But since we are writing directly in the game ROM, we don’t have this mapping. So the addresses we are writing do not begin with 0x08. When it is running on the GBA it will be at 0x08043B0C, but here it is just 0x43B0C.

Okay. Now we should write the table which maps the original addresses to the custom addresses. I wrote some helper functions to make writing constant values easier.

import { range } from 'ramda'const splitHexValueIntoBytesArray = (hexValue, numberOfBytes) =>
range(0, numberOfBytes).map((i) => {
const shift = i * 8
const byte = (hexValue >> shift) & 0xFF
return byte
})
const setConstant = (romBuffer, address, value24hexLength) => {
const bytes = splitHexValueIntoBytesArray(value24hexLength, 4)
romBuffer.set(bytes, address)
}
const setPatchCustomVisionLoader = (romBuffer) => {
...
// write original and custom address of each vision
addresses.forEach(({ custom, original }, index) => {
const offset = 0x367620 + (index * 8)
setConstant(romBuffer, offset, original)
setConstant(romBuffer, offset + 4, custom)
})
...
}

And now we should write the instructions to change the R0 value.

const setPatchCustomVisionLoader = (romBuffer) => {
...
romBuffer.set(
[0x04, 0x00, 0x85, 0xE2],
0x367650
) // add r0, r5, 4h
addresses.forEach((_, index) => {
if (visionsWithCustomTilemap[index]) {
const offset1 = 0x3C + ((index * 12) - (index * 8))
const offset2 = 0x40 + ((index * 12) - (index * 8))
romBuffer.set(
[offset1, 0x40, 0x1F, 0xE5],
0x367654 + (index * 12)
) // ldr r4,[r15,#-offset1]
romBuffer.set(
[0x04, 0x00, 0x50, 0xE1],
0x367658 + (index * 12)
) // cmp r0, r4
romBuffer.set(
[offset2, 0x00, 0x1F, 0x05],
0x36765C + (index * 12)
) // ldreq r0,[r15,#-offset2]
}
})
romBuffer.set(
[0x01, 0x40, 0xA0, 0xE1],
0x367660 + (addresses.length * 12)
) // mov r4, r1
romBuffer.set(
[0x1E, 0xFF, 0x2F, 0xE1],
0x367664 + (addresses.length * 12)
) // bx r14
}

As you may have realized, writing assembly using JS isn’t exactly elegant… But it works! Now we don’t have the bug anymore, because we are saving and loading the custom tilemap in another memory region, where we have enough space to not overwrite.

If you are curious, here is the PR that implements this fix (it’s very short, just 75 new lines).

FANTASTIC!!
Now our tool is completely complete!!!

With it, we finish our very long journey about reverse engineering a Gameboy Advance game! Wow, it was 8 extensive chapters. But keep calm, because we’ll still have the 9th chapter to conclude everything!
After all, we developed a new product, but what feedback did we get from the users? And what have we, as developers, learned with this path? How can you keep focused for one year to develop a personal project just 4fun like this one?
All these analyses will be in the final chapter. Let’s go!

Next and last chapter: And so? What did we get?

--

--