Windows on ARM - An assembly language primer
The ARM CPU has garnered significant attention in the recent past due to its wide-spread usage in mobile devices. With Windows 8, for the first time Microsoft has released a mainstream Windows OS to run on the ARM CPU. Windows CE has been running on ARM for more than a decade now. Developers and support engineers working with the Windows on ARM (WoA) platform need a basic understanding of the ARM CPU and ARM assembler in order to be able to effectively troubleshoot and debug issues that occur at lowest levels of the operating system. Although there is no shortage of information on the ARM CPU architecture and assembly language, there is a very little information on the usage of ARM assembly on Windows 8. This article attempts to provide the reader with enough information to gain a basic understand the ARM assembly language as used by Windows. It does not attempt to be a comprehensive reference manual for the ARM CPU, please refer to references section for detailed information on this topic.
Tools
This section covers some of the tools that were used to research this article.
In order to test the conversion of C/C++ constructs to ARM assembler, the ARM cross compiler that ships with VS2013 was used. To build the ARM executables the compiler was run from a console window as shown below. The following section assumes VS2013 is installed on the system in the default install path.
C:\> cd C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm> vcvarsx86_arm.bat C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm> cd C:\work C:\work> cl /FAcs /Zi /D _ARM_WINAPI_PARTITION_DESKTOP_SDK_AVAILABLE=1 HelloWorld.c
HelloWorld.c contains the C source code, as shown below:
#include "stdio.h" void main (void ) { printf ( "Hello World\n"); }
To study how the individual ARM assembler instructions are translated into ARM opcodes the ARM assembler was used. Once the assembler generated the object (.OBJ) file, the linker (link.exe) was used to examine opcode sequences. All these steps are shown below. The ARM assembler and linker also ship with Visual Studio 2013.
C:\work> armasm HelloASM.asm C:\work> link -dump -disasm HelloASM.obj
HelloASM.asm contains some arbitrary ARM assembler instructions, as listed below:
AREA |.text|, CODE, THUMB |test| PROC subs r0,r0,r3 add r4,r6,r0,lsl #3 addw r11,sp,#8 rsbs r5,r1,#0 bx lr b 0 bl |test| ENDP END
As shown below, the output of the linker contains the opcodes from the .text section of the .OBJ file.
Microsoft (R) COFF/PE Dumper Version 12.00.21005.1 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file HelloASM.obj File Type: COFF OBJECT test: 00000000: 1AC0 subs r0,r0,r3 00000002: EB06 04C0 add r4,r6,r0,lsl #3 00000006: F20D 0B08 addw r11,sp,#8 0000000A: 424D rsbs r5,r1,#0 0000000C: 4770 bx lr 0000000E: E7FE b 0000000E 00000010: F000 F800 bl test Summary 84 .debug$S 14 .text
CPU Version
The research for this article was performed on a Microsoft Surface RT (Generation 1) running on an Nvidia TEGRA 3 Quad Core CPU. The Secure Boot Signing Policy that retail devices like Surface RT ship with does not allow live kernel debugging. It is however possible to configure Surface RT devices to generate complete kernel memory dumps and these memory dumps can be loaded and analyzed on both the X86 and X64 versions of WinDBG. So all the research for this article was done using kernel mode and user mode memory dumps generated on the Surface RT device.
User mode dumps on the Surface RT device were generated by simply using Task Manager's "Create dump file" option.
To generate a complete kernel memory dump on a Surface RT system the commands listed below were run from an administrative command prompt, followed by a system reboot and finally bug-checking the system using the RightCtrl+ScrollLock+ScrollLock key sequence, as described in [8].
wmic recoveros set DebugInfoType = 1 reg add "HKLM\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters" /v CrashOnCtrlScroll /t REG_DWORD /d 0x1
Loading up a kernel complete memory dump generated on the Surface RT in WinDBG v6.3.9600 for X86/X64 shows the Windows 8 kernel was running on a quad-core ARM CPU in Thumb-2 mode.
Loading Dump File [MEMORY.DMP] Kernel Bitmap Dump File: Full address space is available ************* Symbol Path validation summary ************** Response Time (ms) Location Deferred SRV*c:\SYMBOLS*http://msdl.microsoft.com/download/symbols Symbol search path is: SRV*c:\SYMBOLS*http://msdl.microsoft.com/download/symbols Executable search path is: Windows 8 Kernel Version 9200 MP (4 procs) Free ARM (NT) Thumb-2 Product: WinNt, suite: TerminalServer SingleUserTS Built by: 9200.16420.armfre.win8_gdr.120919-1813 Machine Name: Kernel base = 0x8340c000 PsLoadedModuleList = 0x835d08c0
The "!sysinfo cpuinfo" command describes the CPU as ARM Family 7 Cortex-A9 r02p09. Based on this information the ARM Cortex-A9 Technical Reference Manual [3] and the ARM Architecture Reference Manual for ARM-v7-A and ARM-v7-R [4] were used to research this article.
0: kd> !sysinfo cpuinfo [CPU Information] ~MHz = REG_DWORD 1300 Component Information = REG_BINARY 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 Configuration Data = REG_FULL_RESOURCE_DESCRIPTOR ff,ff,ff,ff,ff,ff,ff,ff,0,0,0,0,0,0,0,0 Identifier = REG_SZ ARM Family 7 Model C09 Revision 209 ProcessorNameString = REG_SZ NVIDIA(R) TEGRA(R) 3 Quad Core CPU VendorIdentifier = REG_SZ NVIDIA
Registers
The ARM CPU can execute in User, System, Supervisor, Abort, Undefined, Interrupt (IRQ) and Fast Interrupt (FIQ) modes. In total the ARM CPU has 37 physical registers, each one 32-bits wide. Out of these 37 registers, only 17 registers are visible to software at any given point in time, depending on the mode the CPU is executing in. These registers comprise of thirteen general-purpose registers (r0 to r12) and three special purpose registers (r13- r15) and the CPU Program Status Register (CPSR). The special purpose registers r13, r14, r15 are also referred to as SP, LR, and PC respectively. The CPSR register is similar to the X86/X84 flags register. Unlike the X86, ARM does not contain any segment registers.
The table below lists the ARM CPU registers and their usage.
Register | Description |
---|---|
r0 | Contains the 1st parameter passed to functions. 32-bit function return value, similar to the EAX register on X86. Low-word of 64-bit function return value. |
r1 | Contains the 2nd parameter passed to functions. High-word of 64-bit function return value. |
r2 | Contains the 3rd parameter passed to functions. |
r3 | Contains the 4th parameter passed to functions. |
r4-r10 | General purpose registers, callee saved. |
r11 | Frame Pointer, similar to the EBP register on X86. |
r12 | General purpose register. |
r13 (SP) | Stack Pointer, similar to X86 ESP, callee saved. |
r14 (LR) |
Link Register, contains the return address during a function call.
Callee saved for non-leaf functions. Leaf functions don't save this register since they don't modify it. |
r15 (PC) | Program Counter, similar to EIP on X86. |
CPSR | Current Program Status Register. Similar to the EFlags register on X86. |
APSR | Application Program Status Register. This is not a separate register but the NZCVQ and GE bits of the CPSR that are writable from user mode. |
SPSR | Saved Program Status Register. Copy of the CPSR at the time an exception occurs. SPSR contains the pre-exception value of the CPSR. The CPU contains a separate instance of the SPSR for every exception mode that is supported by the ARM CPU. |
Of the 17 registers mentioned above, r0-r7 and r15 are unbanked registers i.e. they map to the same physical registers irrespective of the mode the CPU is executing in. Registers r8 through r14 are banked i.e. they map to different physical registers depending on the CPU's execution mode. The purpose of banked registers is for the CPU to automatically save and restore these register contents across execution mode changes and ensure that the registers are not overwritten during an exception. Registers r13 and r14 are banked in all execution modes except in System Mode. Registers r8 through r12 are banked only in FIQ mode. In addition, the CPSR register is banked into the SPSR registers in all modes, expect in System Mode. The ARM documentation refers to banked registers with the suffixes svc, abt, und, irq or fiq representing the execution modes of the CPU in which the registers are used.
The following table shows the banked and unbanked registers in all of the different execution modes of the CPU:
User | System | Supervisor | Abort | Undefined | IRQ | FIQ |
---|---|---|---|---|---|---|
r0 | r0 | r0 | r0 | r0 | r0 | r0 |
r1 | r1 | r1 | r1 | r1 | r1 | r1 |
r2 | r2 | r2 | r2 | r2 | r2 | r2 |
r3 | r3 | r3 | r3 | r3 | r3 | r3 |
r4 | r4 | r4 | r4 | r4 | r4 | r4 |
r5 | r5 | r5 | r5 | r5 | r5 | r5 |
r6 | r6 | r6 | r6 | r6 | r6 | r6 |
r7 | r7 | r7 | r7 | r7 | r7 | r7 |
r8 | r8 | r8 | r8 | r8 | r8 | r8_fiq |
r9 | r9 | r9 | r9 | r9 | r9 | r9_fiq |
r10 | r10 | r10 | r10 | r10 | r10 | r10_fiq |
r11 | r11 | r11 | r11 | r11 | r11 | r11_fiq |
r12 | r12 | r12 | r12 | r12 | r12 | r12_fiq |
SP | SP | SP_svc | SP_abt | SP_und | SP_irq | SP_fiq |
LR | LR | LR_svc | LR_abt | LR_und | LR_irq | LR_fiq |
PC | PC | PC | PC | PC | PC | PC |
CPSR | CPSR | SPSR_svc | SPSR_abt | SPSR_und | SPSR_irq | SPSR_fiq |
The list of ARM registers can be examined using the debugger's register display command:
0: kd> r r0=00000000 r1=00000001 r2=835c3d3c r3=00000000 r4=000000e2 r5=835df580 r6=00000000 r7=00000002 r8=00000000 r9=00000000 r10=00000000 r11=82b30890 r12=912ca010 sp=82b305e0 lr=00000000 pc=835364e0 psr=60000053 -ZC-- ARM nt!KeBugCheck2+0xfc: 835364e0 f1150020 adds r0,r5,#0x20
Current Program Status Register (CPSR)
The following figure shows the format of the CPSR register.

The five mode bits M[4:1] contain the values listed in the following table indicating the mode CPU is currently operating in :
Mode | Value | Description |
---|---|---|
USR | 0x10 | User Mode |
FIQ | 0x11 | FastInterrupt Mode |
IRQ | 0x12 | Interrupt Mode |
SVC | 0x13 | Supervisor Mode |
ABT | 0x17 | Abort Mode |
UDF | 0x1B | Undefined Mode |
SYS | 0x1F | System Mode |
The J & T bits determine if the CPU is in ARM or Thumb mode, where J = Jazelle and T = Thumb.
- J=0 & T=0 ARM Mode
- J=0 & T=1 Thumb Mode
The NZCV bits are used by conditional flow control instructions to alter program execution based on the result of compare operations. These bits are set by instructions like cmp, tst, or any other instruction that has an "S" suffix. The 2-letter acronyms in the Condition column in the following table are used as suffixes to branch instructions. The Flags column shows the value of one or more condition bits that would result in the corresponding branch being taken. Examples of such conditional flow control instructions are Conditional Compare and Branch (CBxx) and Conditional Branch (Bxx) and its variants. The xx is the condition suffix as shown below:
Code | Condition | Flags | Description |
---|---|---|---|
0000 (0) | EQ | Z == 1 | Equal |
0001 (1) | NE | Z == 0 | Not equal |
0010 (2) | CS | C == 1 | Carry set |
0011 (3) | CC | N == 1 | Carry clear |
0100 (4) | MI | N == 1 | Description |
0101 (5) | PL | N == 0 | Plus, positive or zero |
0110 (6) | VS | V == 1 | Overflow |
0111 (7) | VC | V == 0 | No overflow |
1000 (8) | HI | (C == 1) && (Z == 0) | Unsigned higher |
1001 (9) | LS | (C == 0) || (Z == 1) | Unsigned lower or same |
1010 (a) | GE | N == V | Signed greater than |
1011 (b) | LT | N != V | Signed less than |
1100 (c) | GT | (Z == 0) && (N == V) | Signed greater than |
1101 (d) | LE | (Z == 1) || (N != V) | Unsigned less than or equal |
1110 (e) | AL | Any | Always (unconditional) |
Trap Frames
The trap frame structure (KTRAP_FRAME) is used by Windows to save and restore register contents during interrupts, system calls and exceptions. Due to the use of banked registers the ARM CPU does not push anything on the stack during an exception, hence the trap frame on the ARM CPU is entirely defined by software. The trap frame structure for ARM CPU, defined in ntddk.h, is as follows:
0: kd> dt nt!_KTRAP_FRAME +0x000 Arg3 : Uint4B +0x004 FaultStatus : Uint4B +0x008 FaultAddress : Uint4B +0x008 TrapFrame : Uint4B +0x00c Reserved : Uint4B +0x010 ExceptionActive : UChar +0x011 ContextFromKFramesUnwound : UChar +0x012 DebugRegistersValid : UChar +0x013 PreviousMode : Char +0x013 PreviousIrql : UChar +0x014 VfpState : Ptr32 _KARM_VFP_STATE +0x018 Bvr : [8] Uint4B +0x038 Bcr : [8] Uint4B +0x058 Wvr : [1] Uint4B +0x05c Wcr : [1] Uint4B +0x060 R0 : Uint4B +0x064 R1 : Uint4B +0x068 R2 : Uint4B +0x06c R3 : Uint4B +0x070 R12 : Uint4B +0x074 Sp : Uint4B +0x078 Lr : Uint4B +0x07c R11 : Uint4B +0x080 Pc : Uint4B +0x084 Cpsr : Uint4B
The kernel debugger .trap command switches the debugger's register context to the given trap frame and displays the contents of the trap frame as shown below:
0: kd> .trap 9f40bd40 NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. r0=00000001 r1=00e8fa70 r2=00000001 r3=00000000 r4=00000000 r5=00000000 r6=00000000 r7=00000000 r8=00000000 r9=00000000 r10=00000000 r11=00e8fa40 r12=00000059 sp=00e8f8d0 lr=754c0c4d pc=7787e496 psr=00000030 ----- Thumb 7787e496 ?? ???
As highlighted by the "NOTE:" in the above output, the trap frame structure does not contain fields for non-volatile registers i.e. R4-R10. At the time of an exception the non-volatile registers are saved in another structure called the KEXCEPTION_FRAME. The KEXCEPTION_FRAME structure is not exposed through public symbols but it is defined in ntddk.h. The macros GENERATE_EXCEPTION_FRAME and RESTORE_EXCEPTION_FRAME are defined in the WDK Header file kxarm.h. These macros are used at the beginning and end of functions respectively to setup and tear down the EXCEPTION_FRAME structures on the stack.
In addition to the CPU registers described above, the KTRAP_FRAME also contains a copy of the CPU's Breakpoint Value registers (Bvr) and the Breakpoint Control Registers (Bcr) which control the configuration and usage of the Bvrs. The KTRAP_FRAME also contains a copy of the CPU's Watchpoint Value Registers (Wvr) and the Watchpoint Control Registers (Wcr) which control the configuration and usage of the Wvrs. All of the breakpoint and watchpoint registers reside in co-processor CP14, more on co-processors later. The maximum number of breakpoints and watch points that are available on a CPU are defined in hardware and these values are cached in the Kernel Processor Control Region (KPCR) structure. The fields KPCR.MaxBreakpoints and KPCR.MaxWatchpoints cache the maximum number of breakpoints and watchpoints respectively. The content of these fields in the KPCR structure is shown below:
0: kd> !pcr KPCR for Processor 0 at 835df000: Major 1 Minor 1 Panic Stack 00000000 Dpc Stack 00000000 Irql addresses: Mask 835df000 Table 835df000 Routine 835df000 0: kd> dt nt!_KPCR 835df000 -y Prcb.Max +0x580 Prcb : +0x510 MaxBreakpoints : 6 +0x514 MaxWatchpoints : 1
The trap frame also optionally points to the Vector Floating Point (VFP) registers, these registers reside in co-processor CP10. These registers are used as either 64-bit "D" floating point registers or as the NEON 128-bit SIMD or "Q" registers. These "D" and "Q" registers are aliased and they map to the same physical bits in the VFP. The VFP register values can be read using the VTSM instruction and written to using the VLDM instruction.
The debugger's default register mask on the ARM i.e. 0x01 causes the 'r' command to display only the integer registers. The other registers described above can be examined by setting the register mask to 0x4f as shown below:
0: kd> rm Register output mask is 1: 1 - Integer state (32-bit) 0: kd> rm 4f 0: kd> rm Register output mask is 4f: 2 - Integer state (64-bit) 4 - Floating-point state 8 - CP14 Debug registers 40 - NEON registers 0: kd> r r0=00000000 r1=00000001 r2=835c3d3c r3=00000000 r4=000000e2 r5=835df580 r6=00000000 r7=00000002 r8=00000000 r9=00000000 r10=00000000 r11=82b30890 r12=912ca010 sp=82b305e0 lr=00000000 pc=835364e0 psr=60000053 -ZC-- ARM d0 = 1.55693191866e-020 d1 = -6.89514063657e+017 d2 = -1.70274371354e+110 d3 = -1.1722284665e-024 d4 = 1.15599243873e-030 d5 = 1.91880731274e-200 d6 = -4.84612895168e-070 d7 = -1.05193623686e-280 d8 = 0 d9 = 0 d10= 0 d11= 0 d12= 0 d13= 0 d14= 0 d15= 0 d16= -1.35814759421e+230 d17= -2.53691079801e+193 d18= -2.53693113536e+193 d19= -1.79419100154e+152 d20= 1.34188733727e-181 d21= 2.50542679408e-076 d22= 2.89674605141e-025 d23= -4.60295648278e+263 d24= -9.19007555787e+178 d25= -1.05193623686e-280 d26= -4.84612895168e-070 d27= 1.91880731274e-200 d28= 2.37038032292e-049 d29= -3.79910152647e-130 d30= 4.1280444982e+244 d31= -2.27199129623e-207 fpscr=00000000 q00=-326.276 -16374.9 0.00642032 -0.000336106 q01=-0.00188197 1.36997e-020 -1.15518e+014 1.3477e-020 q02=1.99929e-025 7.3251e-016 0.000349896 -5.21864e-037 q03=-1.9424e-035 4.91647e-020 -4.03837e-009 1.39308e-021 q04=0 0 0 0 q05=0 0 0 0 q06=0 0 0 0 q07=0 0 0 0 q08=-2.83799e+024 0 -1.12897e+029 7.91201e-038 q09=-2.00905e+019 -0.0198441 -2.83799e+024 -2.13538e-010 q10=6.87617e-010 -0.0600408 4.66981e-023 3.73876e+036 q11=-1.75682e+033 26.0458 0.00163584 -1.13635 q12=-1.9424e-035 4.91647e-020 -4.44726e+022 4.93765e-007 q13=1.99929e-025 7.3251e-016 -4.03837e-009 1.39308e-021 q14=-1.25641e-016 -6.14369e-017 1.5957e-006 -9.31384e+024 q15=-2.7332e-026 -2.64336e+011 7.29623e+030 -3.44372e-010 kbvr[0] =00000000 kbcr[0] =00000000 kbvr[1] =00000000 kbcr[1] =00000000 kwvr[0] =00000000 kwcr[0] =00000000 nt!KeBugCheck2+0xfc: 835364e0 f1150020 adds r0,r5,#0x20
Instruction Set
Windows, like all other modern operating systems, uses the ARM CPU in Thumb-2 mode in which instructions are either 16 bits (Thumb) or 32 bits (ARM). Thumb mode, which was introduced in early ARM processors, allows for higher instruction density and uniform instruction coding but these instructions are limited in functionality as compared to their 32-bit ARM counterparts. Here are some of the limitations:
-
16-bit Thumb instructions only contain 3-bits to identify source and
destination registers.
Consequently only registers R0 - R7 can be accessed by them.
The 32-bit ARM instructions, on the other hand, can access the full set
of R0 - R15 registers.
Following are some examples of 16-bit instructions accessing registers r0 - r7.
Opcode Mnemonic Operand 2304 movs r3,#4 4605 mov r5,r0 2D00 cmp r5,#0 3B01 subs r3,#1 005C lsls r4,r3,#1 - Thumb instructions cannot be predicated i.e. they cannot be made to operate conditionally using the NZCV bits like the ARM instruction set.
- Immediate Values are restricted to 12 bits, so only numbers from 0 to 4095 can be encoded with the instruction. However using the barrel shifter, described later in this article, the immediate number can be multiplied and added to an existing register value to increase its range.
- A Thumb routine can call both Thumb code and ARM code, but it cannot contain non-Thumb instructions. The same goes for an ARM routine.
Thumb-2, introduced in modern ARM processors, allows these limitations to be worked around by enabling compilers and the processor to generate and understand functions which combine both Thumb and ARM instructions in the same instruction stream, without requiring branch instructions to switch from one mode to the other.
During a branch operation the ARM CPU must be told that the target of the branch is a Thumb-2 instruction. This is indicated by setting the least significant bit of the branch address. As a consequence of this when a function pointer is examined in WinDBG it always points at a one byte offset within the function as illustrated below:
0: kd> x nt!IopErrorLogWorkItem 835d84d0 nt!IopErrorLogWorkItem = <no type information> 0: kd> dt 835d84d0 nt!_WORK_QUEUE_ITEM +0x000 List : _LIST_ENTRY [ 0x0 - 0x0 ] +0x008 WorkerRoutine : 0x836c8189 void nt!IopErrorLogThread+0 +0x00c Parameter : (null) 0: kd> u 0x836c8189 nt!IopErrorLogThread+0x1: 836c8188 e92d4ff0 push {r4-r11,lr} 836c818c f20d0b1c addw r11,sp,#0x1C 836c8190 f618fbb8 bl nt!_security_push_cookie (834e0904) 836c8194 f2ad6d58 subw sp,sp,#0x658 836c8198 2300 movs r3,#0 836c819a 930b str r3,[sp,#0x2C] 836c819c 930c str r3,[sp,#0x30] 836c819e f7fffc1b bl nt!IopErrorLogConnectSession (836c79d8)
The function nt!IopErrorLogThread begins at address 0x836c8188, however the field WORK_QUEUE_ITEM.WorkerRoutine contains the address 0x836c8189 which has the least significant bit is set indicating a Thumb-2 instruction stream.
Instruction Encoding
Since instruction sizes in ARM Thumb-2 mode can be both 16 and 32 bit, the way an instruction is encoded plays a critical role in determining the actual instruction size. 32-bit instructions are encoded as 2 separate 16-bit half-words. The value of bits[15:11] of the first half-word determines if the instruction is made of a single half-word (16 bits) or double half-word (32-bits). If the value of bits[15:11] of the first half-word are either 11101 or 11110 or 11111, the half-word is the first half-word of a 32-bit instruction otherwise it is a 16-bit instruction.
Here is an example of an instruction encoded with a 32-bit operand:
0:000> u 77b485de L1 ntdll!TppWorkerThread+0x92: 77b485de f3bf8f5b dmb ish
The opcode for "dmb ish" "f3bf8f5b" is made of 2 16-bit numbers as illustrated below, with the opcode displayed as a single word (32-bit) and and two half-words (16-bit).
0:000> dd 77b485de L1 77b485de 8f5bf3bf 0:000> dw /c1 77b485de L2 77b485de f3bf 77b485e0 8f5b
The following excerpt from the ARMv7 Architecture Reference Manual Section A8.8.43 shows the encoding of the above mentioned DMB instruction in Thumb-2 mode.

The first (lower) 16 bit part of the opcode (0xf3bf) is represented by the binary number "1111 0011 1011 1111" which matches the first half of the instruction encoding.
The second (higher) 16 bit part of the opcode (0x8f5b) is represented by the binary number "1000 1111 0101 1011" which matches the second half of the instruction encoding. The "option" value is binary 1011, and specifies the ISH option to the DMB instruction as shown below:

Here is an example of an instruction encoded with a 16-bit operand:
0: kd> u 834daa62 L1 nt!KiIdleLoop+0x3e: 834daa62 bf10 yield 0: kd> .formats bf10 Evaluate expression: Hex: 0000bf10 Decimal: 48912 Octal: 00000137420 Binary: 00000000 00000000 10111111 00010000 Chars: .... Time: Thu Jan 01 08:35:12 1970 Float: low 6.85403e-041 high 0 Double: 2.41657e-319
The opcode (0xbf10) is represented by the binary number 1011 1111 0001 0000, which matches the instruction bit encoding shown below.

Instructions on the ARM CPU have different variants depending on the prefix that follows the primary mnemonic. These prefixes can be S, W, or .W and determines how the instruction is encoded, whether CPSR are affected and how some of the operands are interpreted.
- Instructions that have an S suffix change the NZCV bits of the CPSR register based on the result of the operation.
- Instructions that have a .W suffix are always encoded as 32-bit ARM instructions as opposed to 16-bit Thumb instructions.
- Instructions that have a W suffix zero extend their 12-bit immediate value i.e. the 3rd operand. ARM 32-bit instructions that don't have the W suffix treat their 3rd operand as a 12-bit constant value and decode it based on the value of most significant 4 bits of the constant i.e. bits 11-8.
Following are some variants of the ADD instruction with the same operands encoded differently based on the suffix immediately following the instruction mnemonic. The first column is the opcode for the instruction.
F1010200 add r2,r1,#0 F1010200 add.w r2,r1,#0 F2010200 addw r2,r1,#0 1C0A adds r2,r1,#0 F1110200 adds.w r2,r1,#0
Barrel Shifter
The ARM instruction set has the capability to combine shift and rotate operations along with arithmetic, logical, compare, load and store operations in a single instruction. This is achieved through the barrel shifter, a hardware logic unit in the CPU shown below:

The barrel shifter implements shift and rotate operations that can be of arithmetic or logical type like:
- Logical Shift Left (LSL)
- Logical Shift Right (LSR)
- Arithmetic Shift Right (ASR)
- Rotate Right (ROR)
- Rotate Right with Extend (RRX)
Examples of instructions that use the barrel shifter:
ea445302 orr r3,r4,r2,lsl #0x14 eb033412 add r4,r3,r2,lsr #0xC
The ORR instruction performs a Logical Shift Left (LSL) of register r2 by 20 positions. The resulting operation becomes r3 = LogicalOR ( r4, LogicalShiftLeft ( r2, 0x14) ).
The ADD instruction performs a Logical Shift Right (LSR) of register r2 by 12 positions. The resulting operation becomes r4 = Add ( r3, LogicalShiftRight ( r2, 0xc) ).
Instruction Ordering
Modern compilers attempt to optimize program execution by generating instruction sequences which may be different from what was intended by the high level programming language.
Modern CPUs also perform multiple run time optimizations like instruction pipelining, write buffering, instruction and data caching, speculative execution and out of order execution. While these optimizations result in faster program execution, there are cases where they may lead to undesirable results. This is especially true for low level operations performed by the OS like cache operations, TLB flushes, page table updates and device register accesses. Barriers prevent both the compiler and CPU from performing the above mentioned optimizations.
The ARM CPU documentation uses the term barrier to refer to CPU optimization prevention. There are 3 different types of barriers that can be used on the ARM CPU.
Instr. | Barrier Type | Description |
---|---|---|
dmb | Data Memory Barrier | Ensures that all explicit memory accesses before the DMB instruction complete before any explicit memory accesses after the DMB instruction start. The DMB instruction is automatically inserted by the compiler whenever any Interlocked family of functions are used in C or C++. Additionally declaring a global variable as volatile results in the compiler generating DMB instructions provided the file is compiled with the /volatile:ms, instead of the /volatile:iso option. |
dsb | Data Synchronization Barrier | Completes when all instructions before this instruction complete. The DSB instruction can be directly inserted using the macro _DataSynchronizationBarrier() which is defined in winnt.h. |
isb | Instruction Synchronization Barrier | Flushes the pipeline in the CPU, so that all instructions following the ISB are fetched from cache or memory, after the ISB has been completed. The ISB instruction can be directly inserted using the macro _InstructionSynchronizationBarrier() which is defined in winnt.h. |
The scope of these barrier instructions can be restricted to sharing domains as well as to specific memory access types. These can specified optionally as instruction suffixes to the barrier instructions. If a barrier instruction does not have a suffix its scope is assumed to be system wide and it applies to both read and write type memory accesses.
Sharing Domain | Suffix | Description |
---|---|---|
Non-Shareable | NSH | Per-Core TLBs |
Inner Shareable | ISH | System Memory |
Outer Shareable | OSH | Device Memory |
Full System | SY or ST | System and Device Memory |
Access Type | Suffix | Comments |
---|---|---|
Read and Write | None | For full system read and write access, the sharing domain and access is combined into the suffix SY. |
Write only | ST | For full system write only access, the sharing domain and access is combined into the suffix ST. |
The following annotated code snippet shows the usage of the ISB instruction to perform a pipeline flush before updating the exception handling settings on the ARM CPU and another one after the update to fetch subsequent instructions directly from memory.
0: kd> uf nt!KiInitializeExceptionVectorTable nt!KiInitializeExceptionVectorTable: 834a1b40 4b07 ldr r3,=nt!KiArmExceptionVectors+0x1 (834dc6a1) 834a1b42 f0330301 bics r3,r3,#1 834a1b46 ee0c3f10 mcr p15,#0,r3,c12,c0 ; r3 = Vector Base Address Register(VBAR) 834a1b4a f3bf8f6f isb 834a1b4e ee113f10 mrc p15,#0,r3,c1,c0 ; System Control Register(SCTLR) 834a1b52 f4335300 bics r3,r3,#0x2000 ; SCTLR.V = 0 ; Use VBAR + Low Offset 834a1b56 ee013f10 mcr p15,#0,r3,c1,c0 ; Vector Base Address Register(VBAR) = r3 834a1b5a f3bf8f6f isb 834a1b5e 4770 bx lr
Interlocked Operations
Unlike the X86 and X64 CPUs, which use the lock prefix before instructions to make them atomic across multiple CPUs, the ARM CPU uses LDREX and STREX and its variants to implement interlocked operations. The LDREX and STREX instructions are used in pairs but there can be other intervening instructions between them.
The following code snippet shows the assembly instructions generated by the compiler during a call to the function InterlockedIncrement ( &g_Lock );.
004010cc f3bf8f5b dmb ish 004010d0 4905 ldr r1,=g_Lock (00419124) 004010d2 e8512f00 ldrex r2,[r1] 004010d6 3201 adds r2,#1 004010d8 e8412300 strex r3,r2,[r1] 004010dc 2b00 cmp r3,#0 004010de d1f8 bne 004010d2 004010e0 f3bf8f5b dmb ish 004010e4 4770 bx lr
In the above function, the combination of the instructions LDREX and STREX form an atomic read modify/write pair with the intervening adds instruction performing the value increment.
The following snippet describes the functionality of the LDREX and STREX instructions.
LDREX r2,[r1] performs the following steps: R2 = *R1 Place an exclusive lock on address R1
STREX r3,r2,[r1] performs the following steps: if ( exclusivelock is held ) *R1 = R2 R3 = 0 else // no exclusive lock R3 = 1
In the STREX example above the R3 register contains success (0) or failure (1) depending on whether R2 was stored in memory pointed to by R1.
Commonly Used Instructions
This section lists the most common instructions that are encountered in functions on the WoA platform. Familiarity with these instructions helps in reading and understanding most of the assembler code generated by the Visual Studio compiler targeting WoA. Instruction opcodes are included to clearly distinguish between 16 and 32 bit Thumb-2 instructions.
Arithmetic Instructions
Opcode | Instruction | Operation |
---|---|---|
1ac0 | subs r0,r0,r3 | Subtract. r0 = r0 - r3 |
eb0604c0 | add r4,r6,r0,lsl #3 | Add with Shift. r4 = r6 + LeftShift (r0, 3) |
f20D0b08 | addw r11,sp,#8 | Add. r11 = sp + 0x8. The .w forces 32-bit opcode generation. |
424d | rsbs r5,r1,#0 | Reverse Subtract. r5 = 0 - r1. |
b21b | sxth r3,r3 | Signed Extend Half-word. r3 = SignExtend16To32Bit(r3). Similar to X86 MOVSX instruction. |
f2c00a61 | movt r10,#0x61 | Move to Top Half. r10[31:16] = 0x61. |
Logical Instructions
Opcode | Instruction | Operation |
---|---|---|
f0530302 | orrs r3,r3,#2 | Bitwise OR. r3 = r3 | 0x02 |
ea834271 | eor r2,r3,r1,ror #0x11 | Bitwise XOR. r2 = r3 ^ RotateRight(r1,11) |
f06f0200 | mvn r2,#0 | Bitwise NOT. r2 = ~(0x0) |
4013 | ands r3,r3,r2 | Bitwise AND. r3 = r3 & r2 |
105b | asrs r3,r3,#1 | Arithmetic Shift Right. r3 = r3 >> 1 |
f033043f | bics r4,r3,#0x3f | Bitwise Bit Clear. r4 = r3 & (~0x3f) |
f36f040b | bfc r4,#0,#0xC | Bit Field Clear, sets the specified bit range to zero. r4[11:0] = 0. |
fa94f3a4 | rbit r3,r4 | Reverse Bits. r3[31:0] = r4[0:31] |
f3c30644 | ubfx r6,r3,#1,#5 | Unsigned Bit Field Extract. r6 = ZeroExtend(r3[5:1]). Extract Bits 1 through 5 from r3, zero extend the result and store in r6. |
Branch
Opcode | Instruction | Operation |
---|---|---|
e002 | b 83456484 | PC Relative Branch. Similar to X86 jmp instruction. |
f7fefc0e | bl 83454c44 | PC Relative Branch and Link. LR = Address of next instruction. Similar to the X86 call instruction. |
4770 | bx lr | Branch to LR. PC=LR. Similar to X86 ret instruction. |
4798 | blx r3 | Branch with Link and Exchange. PC=R3, LR = Address of next instruction. Similar to BL except that BLX can change instruction set from ARM to Thumb, or vice versa. |
f02aaa9b | beq 835236d0 | PC Relative Conditional Branch if equal. If (CPSR.Z == 1) PC = BranchTarget. Similar to the X86 JZ instruction. |
d067 | beq 83429a36 | PC Relative Conditional Branch if equal. If (CPSR.Z == 1) PC = BranchTarget. Similar to the X86 JZ instruction. Since the opcode for this instruction is 32-bits its target range is much larger than the previous instruction. |
bb6b | cbnz r3,83429d80 | PC Relative Compare and Branch on Nonzero. if ( R3 != 0 ) PC = BranchTarget. The range of such branches is +4 to +130 bytes. |
Compare and Test
Opcode | Instruction | Operation |
---|---|---|
f0130f10 | tst r3,#0x10 | Set flags based on bitwise AND operation. CPSR.Flags = r3 & 0x10 |
ea930f00 | teq r3,r0 | Set flags based on bitwise XOR operation. CPSR.Flags = r3 ^ r0 |
2800 | cmp r0,#0 | Set flags based on subtraction operation. CPSR.Flags = r0 - ZeroExtend(0x0). The immediate operand is zero extended to make it 32-bits wide. |
f1150f02 | cmn r5,#2 | Set flags based on addition operation. CPSR.Flags = r0 + 0x2 |
Data Movement
Opcode | Instruction | Operation |
---|---|---|
781B | ldrb r3,[r3] | Load Register Byte. r3 = ZeroExtend(*r3). Similar to the X86 mov byte ptr instruction. |
69bb | ldr r3,[r7,#0x18] | Load Register. r3 = *(r7+0x18) |
534B | strh r3,[r1,r5] | Store Register Halfword. *(r1+r5) = r3[15:0] |
F8858166 | strb r8,[r5,#0x166] | Store Register Byte. *(r5+0x166) = r8[7:0] |
e92d48b8 | push {r3-r5,r7,r11,lr} | Save registers r3,r4, r5, r7, r11, r14 to the stack and decrement SP |
e8bd8800 | pop {r11,pc} | Restore register r11 and r15 from the stack and increment SP |
4642 | mov r2,r8 | r2=r8 |
Special Instructions
Windows use the ARM CPU's capability of generating exceptions on undefined instructions to process "well known" undefined instructions which are essentially opcodes that are construed as undefined by ARM but convey meaning to the Window's exception handling mechanism. 16-bit instructions starting with a 0xDE are undefined and lead to an Undefined Instruction exception which is handled by nt!KiUndefinedInstructionException. While executing an undefined instruction, the CPSR.Mode is set to 11011b i.e. Undefined.
KiUndefinedInstructionException() directly handles certain undefined instructions like __ rdpmccntr64, but for the rest, it simply dispatches the exception to KiDispatchException() which in turns calls KiPreprocessInternalInvalidOpcode(). WoA uses the following undefined instructions:
Opcode | Mnemonic | Description |
---|---|---|
0xDEFE | __debugbreak | Breaks into the debugger. Used by ntdll!DbgUserBreakPoint(). |
0xDEFC | __assertfail | Used to indicate critical assertion failures in the kernel debugger. Used by KeAccumulateTicks() |
0xDEFB | __fastfail | Indicates fast fail conditions resulting in KeBugCheckEx(KERNEL_SECURITY_CHECK_FAILURE). Called by functions like InsertTailList() upon detecting a corrupted list, as described in [9]. |
0xDEFA | __rdpmccntr64 | Reads the 64-bit performance counter co-processor register and returns the value in R0+R1. Used by ReadTimeStampCounter(), KiCacheFlushTrial() etc. |
0xDEFD | __debugservice | Invoke debugger breakpoint. Used by DbgBreakPointWithStatusEnd(), DebugPrompt() etc. |
0xDEF9 | __brkdiv0 | Divide By Zero Exception, used by functions like nt!_rt_udiv and nt!_rt_udiv. Also generated by the compiler to check the divisor before division operations. |
Calling Convention
The ARM CPU and the X64 CPU have very similar calling conventions in that the first four parameters to a function are passed via registers. However, unlike the X64 that has a register spill space, the ARM compiler does not reserve any space on the stack for register based parameters. Another similarity between X64 and ARM is that only the function prolog and epilog modify the value of the stack pointer (SP), the function body never changes SP. The registers used for parameter passing on the ARM CPU are listed below:
- R0 = Parameter #1
- R1 = Parameter #2
- R2 = Parameter #3
- R3 = Parameter #4
- The fifth parameter onwards is stored on the stack.
The following figure shows assembler code sequence during a function call.

Function Prolog and Epilog
The following code snippet is an example of instructions that typically make up the prolog of a non-leaf function:
nt!ExAllocatePoolWithTag: 835a7000 e92d4ff0 push {r4-r11,lr} 835a7004 f20d0b1c addw r11,sp,#0x1C 835a7008 b08f sub sp,sp,#0x3C
The push instruction above saves the volatile registers R4, R5, R6, R7, R8, R9, R10, R11 and LR (R15) on the stack. LR (R15) is used to return execution control back to the caller.
The addws sets up the r11 register to point to the location of the stack where the old r11 register was saved. This creates a frame pointer chain similar to the one created on the X86 with the EBP register.
And finally the sub instruction creates space on the stack for local variables.
The corresponding function epilog is shown below:
nt!ExAllocatePoolWithTag+0x98: 835a7098 b00f add sp,sp,#0x3C 835a709a e8bd8ff0 pop {r4-r11,pc}
The add instruction in the above snippet simply adjusts the stack pointer to skip over the local variables.
The pop instruction restores back the contents of the non-volatile registers which were saved in the function prolog.
The value of the saved LR register (i.e. the return address) is restored back into the PC, thus returning control back to the caller and obviating the need for an explicit branch instruction.

The prolog and epilog for leaf functions (i.e. function that don't call others) are very different from the sequence shown above. Following is the complete disassembly of a non-leaf function:
nt!IopGetDeviceAttachmentBase: 83456478 f8d030b0 ldr r3,[r0,#0xB0] 8345647c e002 b nt!IopGetDeviceAttachmentBase+0xc (83456484) 8345647e 4618 mov r0,r3 83456480 f8d330b0 ldr r3,[r3,#0xB0] 83456484 699b ldr r3,[r3,#0x18] 83456486 2b00 cmp r3,#0 83456488 d1f9 bne nt!IopGetDeviceAttachmentBase+0x6 (8345647e) 8345648a 4770 bx lr
In the code snippet shown above, the LR register contains the return address of the caller upon entry. Since this function does not modify the LR register contents, returning to the caller simply involves branching to LR i.e. "bx lr".
Function Disassembly Walkthrough
To tie together all the concepts introduced above, this section provides a complete annotated listing of the user mode function CreateFileA() in kernelbase.dll.
Here is the prototype of CreateFileA() along with the registers and stack locations that would contain the parameters passed in by the caller.
HANDLE WINAPI CreateFile( LPCTSTR lpFileName, ; P1 = r0 DWORD dwDesiredAccess, ; P2 = r1 DWORD dwShareMode, ; P3 = r2 LPSECURITY_ATTRIBUTES lpSecurityAttributes, ; P4 = r3 DWORD dwCreationDisposition, ; P5 = stack sp[0] DWORD dwFlagsAndAttributes, ; P6 = stack sp[4] HANDLE hTemplateFile ); ; P7 = stack sp[8]
The term "callee" used in the following code snippet refers to the function CreateFileW() which is called by CreateFileA(). Figure 8 depicts the state of the stack after the sub instruction has executed i.e. prolog for CreateFileA() has completed.
0:000> uf kernelbase!CreateFileA KERNELBASE!CreateFileA: 757a1628 e92d4870 push {r4-r6,r11,lr} ; save only those non-volatile ; registers that will be overwritten 757a162c f20d0b0c addw r11,sp,#0xC ; point r11 to the location on the stack ; where callers r11 (frame pointer) is stored 757a1630 b087 sub sp,sp,#0x1C ; create space for local variables (0x10) bytes ; and for parameters to callees (0xc) bytes 757a1632 460e mov r6,r1 ; r6 = r1 = dwDesiredAccess(caller P2) 757a1634 4601 mov r1,r0 ; r1 = r0 = lpFileName(caller P1) 757a1636 a804 add r0,sp,#0x10 ; r0 = sp + 0x10 757a1638 461c mov r4,r3 ; r4 = r3 = lpSecurityAttributes(caller P4) 757a163a 4615 mov r5,r2 ; r5 = r2 = dwShareMode(caller P3) 757a163c f000fc1a bl KERNELBASE!Basep8BitStringToDynamicUnicodeString (757a1e74) 757a1640 b1a0 cbz r0,KERNELBASE!CreateFileA+0x44 (757a166c) ; if the return value ; from the previous function call (r0) is 0 then ; goto 757a166c (exit) KERNELBASE!CreateFileA+0x1a: 757a1642 9b0e ldr r3,[sp,#0x38] ; r3 = *(sp+0x38) = hTemplateFile(caller P7) 757a1644 9805 ldr r0,[sp,#0x14] ; r0 = *(sp+0x14) Local = lpFileName(callee P1) 757a1646 462a mov r2,r5 ; r2 = r5 = dwShareMode(caller P3) 757a1648 9302 str r3,[sp,#8] ; *(sp+0x8) = r3 = hTemplateFile(callee P7) 757a164a 9b0d ldr r3,[sp,#0x34] ; r3 = *(sp+0x34) = dwFlagsAndAttributes(caller P6) 757a164c 4631 mov r1,r6 ; r1 = r6 = dwDesiredAccess(callee P2) 757a164e 9301 str r3,[sp,#4] ; *(sp+0x4) = r3 = dwFlagsAndAttributes(callee P6) 757a1650 9b0c ldr r3,[sp,#0x30] ; r3 = *(sp+0x30) = dwCreationDisposition(caller P5) 757a1652 9300 str r3,[sp] ; *(sp+0x0) = r3 = dwCreationDisposition(callee P5) 757a1654 4623 mov r3,r4 ; r3 = r4 = lpSecurityAttributes(callee P4) 757a1656 f000fe61 bl KERNELBASE!CreateFileW (757a231c) ; invoke callee i.e. CreateFileW() 757a165a 4b06 ldr r3,=KERNELBASE!_imp_RtlFreeUnicodeString (758119c0) 757a165c 4604 mov r4,r0 ; r4 = r0 = return value from CreateFileW() 757a165e a804 add r0,sp,#0x10 ; r0 = sp + 0x10 = AnsiString = P1 to RtlFreeAnsiString() 757a1660 681b ldr r3,[r3] ; r3 = ntdll!RtlFreeAnsiString 757a1662 4798 blx r3 ; call RtlFreeAnsiString() KERNELBASE!CreateFileA+0x3c: 757a1664 4620 mov r0,r4 ; CreateFileA() return value in r0 757a1666 b007 add sp,sp,#0x1C ; free locals and parameter space on stack 757a1668 e8bd8870 pop {r4-r6,r11,pc} ; restore all saved permanent ; registers and return to caller KERNELBASE!CreateFileA+0x44: 757a166c f06f0400 mvn r4,#0 ; r4 = ~0x0 = 0xffffffff = -1 (INVALID_HANDLE_VALUE) 757a1670 e7f8 b KERNELBASE!CreateFileA+0x3c (757a1664)

Disassembly Listing
One of the things that will become quickly apparent, when examining ARM disassembly in WinDBG, is that the more often than not the debugger's "uf" command will display the following warning.
0: kd> uf nt!IoCallDriver Flow analysis was incomplete, some code may be missing
This sections explains why this happens. The ARM compiler generates branches to absolute addresses using instruction sequences similar to the following:
0: kd> uf nt!IoCallDriver . . . nt!SMKM_STORE_MGR<SM_TRAITS>::SmpPageEvict+0x3b0: 835237f0 f6411c87 mov r12,#0x1987 835237f4 f2c83c55 movt r12,#0x8355 835237f8 4760 bx r12
WinDBG, as of version 6.3.9600, does not pay attention to mov instructions because they do not fall under the category of flow control instructions. WinDBG encounters the bx r12 instruction, and gives up on the static disassembly because it assumes that the value of r12 will be determined at runtime. It however misses the fact that the above sequence amounts to bx 0x83551978 which is nothing but a call to another function, as shown in the figure below:

So any time WinDBG encounters an indirect branch via a register it fails to follow the function in its entirety.
Co-Processor
The ARM CPU has multiple co-processors that implement functionality that is not a part of core instruction execution. The co-processors that are used by Windows, as well other operating systems, are:
- CP10 (Vector Floating Point Co-processor)
- CP14 (Debug Co-processor)
- CP15 (System control Co-processor)
The MRC and MCR instructions are used to access the co-processor registers. The VFP (CP10) can be also be accessed using VMSR and VMRS instructions.
The compiler intrinsics MoveFromCoprocessor() and MoveToCoprocessor() and their variants can be used to access ARM co-processors from C/C++. The Visual Studio 2013 CRT source file "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\crt\src\ARM\helpexcept.c" has examples on how to use these intrinsics.
Since the CP15 co-processor contains the most critical registers required by Windows, some details of this co-processor are included in this section. The CP15 registers are organized by function groups with each group represented by a single primary co-processor register referred to as CRn. The function group description and the corresponding primary control register is listed in the table below:
CRn | Functionality |
---|---|
c0 | ID and Feature Registers |
c1 | System Control Register |
c2 | Translation Table Base |
c3 | Domain Access Control |
c5 | Fault Status |
c6 | Fault Address Register |
c7 | Cache/Write Buffer Control |
c8 | TLB Maintenance Operations |
c9 | Performance Counters |
c10 | Memory Mapping Registers & TLB Operations |
c11 | DMA Control |
c12 | Security Extensions registers |
c13 | Process, Context & Thread ID Registers |
The following table contains some examples of CP15 registers that are used by Windows for various low level operations. Individual CP15 registers are selected by the primary co-processor register (CRn), the secondary co-processor register (CRm), OpCode #1 (Op1) and OpCode#2 (Op2).
CP# | Opc1 | CRn | CRm | Opc2 | Description |
---|---|---|---|---|---|
p15 | 0 | c1 | c0 | 0 | SCTLR System Control Register (Used by KiInitializeExceptionVectorTable to setup exception handling) |
p15 | 0 | c2 | c0 | 0 | TTBR0 Translation Table Base Register 0 (KxSwapProcess writes the Page Table Base Address to this during context switch, similar to X86 CR3) |
p15 | 0 | c5 | c0 | 0 | DFSR Data Fault Status Register (KiDataAbortException uses this to find the type of data fault that occurred) |
p15 | 0 | c5 | c0 | 1 | IFSR Instruction Fault Status Register (KiPrefetchAbortException uses this to find the type of instruction fetch fault that occurred) |
p15 | 0 | c6 | c0 | 0 | DFAR Data Fault Address Register (KiDataAbortException uses this to find the address at which the fault occurred, similar to X86 CR2) |
p15 | 0 | c6 | c0 | 2 | IFAR Instruction Fault Address Register (KiPrefetchAbortException uses this to find the address at which the fault occurred, similar to X86 CR2) |
p15 | 0 | c9 | c13 | 0 | PMCCNTR Cycle Count Register (Used by the compiler intrinsic __rdpmccntr64) |
p15 | 0 | c12 | c0 | 0 | VBAR Vector Base Address Register (base of exception table, contains nt!KiArmExceptionVectors) |
p15 | 0 | c13 | c0 | 1 | CONTEXTIDR Context ID Register (contains Address Space IDentifier i.e. KPROCESS->Asid) |
p15 | 0 | c13 | c0 | 2 | TPIDRURW Thread ID User Read Write (TEB Thread Environment Block) |
p15 | 0 | c13 | c0 | 3 | TPIDRURO Thread ID User Read Only, Privileged Read Write (31:6 KTHREAD, 3:0 IRQL) |
p15 | 0 | c13 | c0 | 4 | TPIDRPRW Privileged Read Write (KPCR Kernel Process Control Region) |
A full list of co-processor options is available in [3].
The following code snippet shows the MRC and MCR instructions accessing the contents of the TPIDRUR0 register in CP15 using primary register c13, secondary register c0, OpCode1=0 and OpCode2=3. The MRC instruction reads the contents of TPIDRUR0 into ARM register r0. The MCR instruction writes the contents of ARM register r0 to TPIDRUR0. Figure 10 labels the various operands passed to the MRC instruction.
ee1d0f70 mrc p15,#0,r0,c13,c0,#3 ; r0 = TPIDRURO ee0d0f70 mcr p15,#0,r0,c13,c0,#3 ; TPIDRURO = r0

Here are some Windows kernel mode functions that access CP15 registers.
0: kd> uf nt!PsGetCurrentProcess nt!PsGetCurrentProcess: 83442a18 ee1d3f70 mrc p15,#0,r3,c13,c0,#3 ; R3 = TPIDRURO 83442a1c f033033f bics r3,r3,#0x3F ; r3 = r3 & ~0x3f 83442a20 6f58 ldr r0,[r3,#0x74] ; r0 = r3 + 0x74 ; r0 = KTHREAD.ApcState.Process 83442a22 4770 bx lr ; return
0: kd> uf hal!KeGetCurrentIrql hal!KeGetCurrentIrql: 8392c380 ee1d3f70 mrc p15,#0,r3,c13,c0,#3 ; R3 = TPIDRURO 8392c384 f013000f ands r0,r3,#0xF ; r0 = r3 & 0x0f ; R0 = Irql 8392c388 4770 bx lr ; return
Following is a user mode function that accesses CP15. Some of the low level macros like NtGetCurrentTeb() which access CP15 are defined in winnt.h.
0:000> uf kernel32!GetCurrentThreadId kernel32!GetCurrentThreadId: 77361fd0 ee1d3f50 mrc p15,#0,r3,c13,c0,#2 ; R3 = TPIDRURW 77361fd4 6a58 ldr r0,[r3,#0x24] ; TEB.ClientId.UniqueThread 77361fd6 4770 bx lr ; return
System Calls
The SVC instruction causes a Supervisor Call exception. This provides a mechanism for unprivileged software (user mode applications) to make calls into the operating system (kernel routines). WoA uses this mechanism to implement native system calls similar to the int 0x2e, sysenter and syscall instructions on the X86 and X64 CPUs. In the code snippet shown below the NTDLL native API NtClose uses the SVC #1 instruction to invoke the exception handler for system call exceptions (nt!KiSWIException). This service index for NtClose() is 0x0d. The usage of register r12 to pass the service index into the system call is recommended by the ARM Application Binary Interface (ABI).
0:000> uf ntdll!NtClose ntdll!NtClose: 77b8e230 f04f0c0d mov r12,#0xD ; r12 = system call identifier 77b8e234 df01 svc #1 ; call into kernel 77b8e236 4770 bx lr ; return to caller
WoA uses a system service dispatch table similar to the one on X64. The kernel variable nt!KiServiceTable points to a table that contains 32 bit entries each containing a 28 bit relative service offset and a 4 bit argument count. The kernel initialization function nt!KeCompactServiceTable() sets up the table. The logic ServiceAddress = KiServiceTable + KiServiceTable[ServiceIndex] >> 4 ) computes the address of the function that implements the native service. The "return from exception" instruction (i.e. RFE sp) transfers execution back to user mode.
The following example shows the address of the function nt!NtClose being computed relative to the base of the table at nt!KiServiceTable using the service index 0x0d.
0: kd> u nt!KiServiceTable + ( poi(nt!KiServiceTable + (d * 4)) >> 4 ) nt!NtClose+0x1: 8364a924 e92d4ff0 push {r4-r11,lr} 8364a928 f20d0b1c addw r11,sp,#0x1C 8364a92c f695ffea bl nt!_security_push_cookie (834e0904) 8364a930 b08a sub sp,sp,#0x28 8364a932 4605 mov r5,r0 8364a934 ee1d3f70 mrc p15,#0,r3,c13,c0,#3 8364a938 f033033f bics r3,r3,#0x3F 8364a93c f993815a ldrsb r8,[r3,#0x15A]
Exception Handling
On the X86/X64 CPU the Interrupt Descriptor Table (IDT) contains pointers to exception handlers, software interrupt handlers and hardware interrupt handlers. On the ARM CPU, has a separate exception vector table that contains instruction opcodes instead of function pointers. The opcode for each type of exception in the table is the same (0xf8dff01c) and it encodes an instruction that will transfer execution control to the PC relative offset to the handler for that exception. As a part of system startup, the kernel function nt!KiInitializeExceptionVectorTable() writes the address of the Windows exception vector table (nt!KiArmExceptionVectors) to the Vector Base Address Register (VBAR) in CP15. The ARM exception table along with the registered exception handlers is shown below.
0: kd> u nt!KiArmExceptionVectors nt!KiArmExceptionVectors: 834dc6a0 f8dff01c ldr pc,=0xFFFFFFFF ; [nt!KiArmExceptionVectors+0x20(834dc6c0)] 834dc6a4 f8dff01c ldr pc,=nt!KiUndefinedInstructionException+0x1 (834dade1) ; [nt!KiArmExceptionVectors+0x24 (834dc6c4)] 834dc6a8 f8dff01c ldr pc,=nt!KiSWIException+0x1 (834db941) ; [nt!KiArmExceptionVectors+0x28 (834dc6c8)] 834dc6ac f8dff01c ldr pc,=nt!KiPrefetchAbortException+0x1 (834db001) ; [nt!KiArmExceptionVectors+0x2c (834dc6cc)] 834dc6b0 f8dff01c ldr pc,=nt!KiDataAbortException+0x1 (834db161) ; [nt!KiArmExceptionVectors+0x30 (834dc6d0)] 834dc6b4 f8dff01c ldr pc,=0xFFFFFFFF ; [nt!KiArmExceptionVectors+0x34(834dc6d4)] 834dc6b8 f8dff01c ldr pc,=nt!KiInterruptException+0x1 (834db601) ; [nt!KiArmExceptionVectors+0x38 (834dc6d8)] 834dc6bc f8dff01c ldr pc,=nt!KiFIQException+0x1 (834db721) ; [nt!KiArmExceptionVectors+0x3c (834dc6dc)]

On the X86 and X64 there is a single exception handler that handles all types of page faults. On the ARM CPU there are two different handlers one for data page faults (nt!KiDataAbortException) and another one for code page faults (nt!KiPrefetchAbortException). Both these exception handlers call the common routine nt!KiCommonMemoryManagementAbort to perform the bulk of page fault handling.
Fast IRQ handling is not supported on the WoA platform. Examining the implementation of the FIQ exception handler (nt!KiFIQException) shows that this function if ever called would bug-check the system with the stop code 0x3d (INTERRUPT_EXCEPTION_NOT_HANDLED).
0: kd> uf KiFIQException nt!KiFIQException: 834db720 e98dc011 srs sp,#0x11 834db724 e9cd4502 strd r4,r5,[sp,#8] 834db728 466c mov r4,sp . . . nt!KiFIQException+0x10c: 834db82c 203d movs r0,#0x3D 834db82e 2100 movs r1,#0 834db830 2200 movs r2,#0 834db832 2300 movs r3,#0 834db834 468c mov r12,r1 834db836 f000fa51 bl nt!KiBugCheckDispatch (834dbcdc) 834db83a defe __debugbreak
Interrupt Descriptor Tables
On the X86/X64 CPU, drivers register their interrupt service routines (ISRs) through a system provided template directly in the interrupt descriptor table (IDT). ARM platforms that have a generic interrupt controller (GIC) do not support vectored interrupts. So WoA routes all hardware interrupts through a single entry point (nt!KiInterruptException) which is responsible for determining the source of the interrupt from the GIC and then dispatching the interrupt to the appropriate driver's ISR.
Similar to the X64 CPU, WoA uses a total of 16 IRQLs. The IRQLs associated with hardware devices are in the range 0x8 through 0xb. For each device IRQL, the first 16 device interrupts at that IRQL are registered directly in the KPCR->Idt[] array. Any overflow interrupts i.e. beyond the 16 interrupts per device IRQL, are registered in the KPCT->IdtExt[] array. The function KiConnectInterruptInternal() determines if there is an overflow situation and accordingly allocates the extended IDT at KPCT->IdtExt from NonPagedPool with 0x400 entries. Both the primary IDT (KPCR->Idt[]) and the extended IDT (KPCR->IdtExt[]) contain pointers to KINTERRUPT structures that were allocated as a result of drivers registration of their ISR.
The following debugger commands show one such KINTERRUPT structure.
0: kd> !pcr KPCR for Processor 0 at 835df000: Major 1 Minor 1 Panic Stack 00000000 Dpc Stack 00000000 Irql addresses: Mask 835df000 Table 835df000 Routine 835df000 0: kd> dt nt!_KPCR 835df000 -a Idt +0x12c Idt : . . . [128] 0x8fb38a80 Void [129] 0x8fb38880 Void [130] 0x8e723980 Void [131] 0x8e723600 Void [132] 0x8e723900 Void [133] 0x8e723200 Void [134] 0x8e723e00 Void [135] 0x8e723e80 Void . . . [144] 0x8fb38b00 Void [145] 0x8fb38900 Void [146] 0x8e723a80 Void [147] 0x8e723680 Void [148] 0x8e723a00 Void [149] 0x8e723500 Void [150] 0x8e723300 Void [151] 0x8e723f00 Void . . . 0: kd> dt nt!_KINTERRUPT 0x8fb38a80 +0x000 Type : 0n22 +0x002 Size : 0n112 +0x004 InterruptListEntry : _LIST_ENTRY [ 0x8fb38a84 - 0x8fb38a84 ] +0x00c ServiceRoutine : 0x90026211 unsigned char dxgkrnl!DpiFdoLineInterruptRoutine+0 +0x010 MessageServiceRoutine : (null) +0x014 MessageIndex : 0 +0x018 ServiceContext : 0x87b1d768 Void +0x01c SpinLock : 0 +0x020 TickCount : 0 +0x024 ActualLock : 0x877cb360 -> 0 +0x028 DispatchAddress : (null) +0x02c Vector : 0x800 +0x030 Irql : 0x8 '' +0x031 SynchronizeIrql : 0xb '' +0x032 FloatingSave : 0 '' +0x033 Connected : 0x1 '' +0x034 Number : 0 +0x038 ShareVector : 0x1 '' +0x03a ActiveCount : 0 +0x03c InternalState : 0n0 +0x040 Mode : 0 ( LevelSensitive ) +0x044 Polarity : 0 ( InterruptPolarityUnknown ) +0x048 ServiceCount : 0 +0x04c DispatchCount : 0 +0x050 PassiveEvent : (null) +0x054 TrapFrame : 0x82b30d30 _KTRAP_FRAME +0x058 DispatchCode : [4] 0 +0x068 DisconnectData : (null) +0x06c ServiceThread : (null)
Unlike the X86/X64 where the IDT is a hardware defined structure, on the ARM CPU the IDT is software defined. This has an interesting security benefit in that the KINTERRUPT structure on ARM no longer needs to contain any executable code, as can be observed from the size of the KINTERRUPT.DispatchCode[] array in the above output, and hence it can be allocated out of Non-Executable NonPagedPool.
In addition to the primary and extended IDTs described above, WoA also uses a global secondary IDT for General Purpose I/O (GPIO) interrupts. This IDT is allocated from non-paged pool and is pointed to by the global variable nt!KiGlobalSecondaryIDT. Each entry in this table is of type KSECONDARY_IDT_ENTRY which contains an embedded KINTERRUPT structure as shown below. The current implementation allocates the secondary IDT with 0x100 entries.
0: kd> db nt!KiSecondaryInterruptServicesEnabled L1 835c0ad7 01 0: kd> ? poi (nt!KiGlobalSecondaryIDT) Evaluate expression: -2037473280 = 868ea000 0: kd> dt 868ea000 nt!_KSECONDARY_IDT_ENTRY +0x000 SpinLock : 0 +0x004 ConnectLock : _KEVENT +0x014 LineMasked : 0 '' +0x018 InterruptList : 0x8fb38e80 _KINTERRUPT 0: kd> dt 0x8fb38e80 nt!_KINTERRUPT +0x000 Type : 0n22 +0x002 Size : 0n112 +0x004 InterruptListEntry : _LIST_ENTRY [ 0x8fb38e84 - 0x8fb38e84 ] +0x00c ServiceRoutine : 0x8c7ab761 unsigned char portcls!CInterruptSync::GetKInterrupt+0 +0x010 MessageServiceRoutine : (null) +0x014 MessageIndex : 0 +0x018 ServiceContext : 0x87d92d68 Void +0x01c SpinLock : 0 +0x020 TickCount : 0 +0x024 ActualLock : 0x87d92d98 -> 0 +0x028 DispatchAddress : (null) +0x02c Vector : 0x1000 +0x030 Irql : 0xb '' +0x031 SynchronizeIrql : 0xb '' +0x032 FloatingSave : 0 '' +0x033 Connected : 0x1 '' +0x034 Number : 0 +0x038 ShareVector : 0x1 '' +0x03a ActiveCount : 0 +0x03c InternalState : 0n0 +0x040 Mode : 1 ( Latched ) +0x044 Polarity : 0 ( InterruptPolarityUnknown ) +0x048 ServiceCount : 0 +0x04c DispatchCount : 0 +0x050 PassiveEvent : (null) +0x054 TrapFrame : (null) +0x058 DispatchCode : [4] 0 +0x068 DisconnectData : (null) +0x06c ServiceThread : (null)
As of WinDBG v6.3.9600, the debugger's "!idt" and "!idt -a" commands display all of the 3 IDTs mentioned above, but only expand the entries in the secondary IDT as shown below:
0: kd> !idt -a Dumping IDT: 835df12c Dumping Extended IDT: 00000000 Dumping Secondary IDT: 868ea000 1000:portcls!CInterruptSync::GetKInterrupt+0x20 (KINTERRUPT 8fb38e80) 1002:mbtu97w8arm+0x358c (KMDF) (KINTERRUPT 8fb38f00) 1003:hidi2c!OnInterruptIsr (KMDF) (KINTERRUPT 8e723180) 1004:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723100) 1005:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723080) 1006:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723000) 1007:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8fb38f80) 1008:hidi2c!OnInterruptIsr (KMDF) (KINTERRUPT 8e723280) 100a:nvthml+0x2ebc (KMDF) (KINTERRUPT 8fb38d00) 100b:sdbus!SdbusGpioInterrupt (KINTERRUPT 8fb38b80)
Conclusion
This article described the ARM CPU, registers and Thumb-2 instructions. It explained the functionality of the instructions typically seen in code generated by the Visual Studio compiler as well as details of the function calling convention. It covered some unique aspects of the ARM CPU like the barrel shifter, the co-processors, and explicit opcodes used for memory barriers and undefined instructions, while also explaining how such aspects are used by Windows. This article also highlighted some of the key differences between how certain features like trap frames, exception handling, interrupt dispatching, interrupt descriptor tables, system calls and interlocked operations are implemented on ARM as compared to X86/X64.
Special thanks to Alex Ionescu (@aionescu) for his review and valuable feedback on this article.