Windows on ARM - An assembly language primer


The ARM CPU has garnered significant attention in the recent past due to its wide-spread usage in mobile devices. With Windows 8, for the first time Microsoft has released a mainstream Windows OS to run on the ARM CPU. Windows CE has been running on ARM for more than a decade now. Developers and support engineers working with the Windows on ARM (WoA) platform need a basic understanding of the ARM CPU and ARM assembler in order to be able to effectively troubleshoot and debug issues that occur at lowest levels of the operating system. Although there is no shortage of information on the ARM CPU architecture and assembly language, there is a very little information on the usage of ARM assembly on Windows 8. This article attempts to provide the reader with enough information to gain a basic understand the ARM assembly language as used by Windows. It does not attempt to be a comprehensive reference manual for the ARM CPU, please refer to references section for detailed information on this topic.

Tools

This section covers some of the tools that were used to research this article.

In order to test the conversion of C/C++ constructs to ARM assembler, the ARM cross compiler that ships with VS2013 was used. To build the ARM executables the compiler was run from a console window as shown below. The following section assumes VS2013 is installed on the system in the default install path.

C:\> cd C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm

C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm> vcvarsx86_arm.bat

C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\x86_arm> cd C:\work

C:\work> cl /FAcs /Zi /D _ARM_WINAPI_PARTITION_DESKTOP_SDK_AVAILABLE=1  HelloWorld.c

HelloWorld.c contains the C source code, as shown below:

#include "stdio.h"
void main (void )
{
  printf ( "Hello World\n");
}

To study how the individual ARM assembler instructions are translated into ARM opcodes the ARM assembler was used. Once the assembler generated the object (.OBJ) file, the linker (link.exe) was used to examine opcode sequences. All these steps are shown below. The ARM assembler and linker also ship with Visual Studio 2013.

C:\work> armasm HelloASM.asm

C:\work> link -dump -disasm HelloASM.obj

HelloASM.asm contains some arbitrary ARM assembler instructions, as listed below:

  AREA	|.text|, CODE, THUMB
|test|	 PROC
  subs r0,r0,r3
  add r4,r6,r0,lsl #3
  addw r11,sp,#8
  rsbs r5,r1,#0 
  bx lr
  b 0 
  bl |test|
  ENDP
  END

As shown below, the output of the linker contains the opcodes from the .text section of the .OBJ file.

Microsoft (R) COFF/PE Dumper Version 12.00.21005.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file HelloASM.obj

File Type: COFF OBJECT

test:
  00000000: 1AC0      subs        r0,r0,r3
  00000002: EB06 04C0 add         r4,r6,r0,lsl #3
  00000006: F20D 0B08 addw        r11,sp,#8
  0000000A: 424D      rsbs        r5,r1,#0
  0000000C: 4770      bx          lr
  0000000E: E7FE      b           0000000E
  00000010: F000 F800 bl          test

  Summary
          84 .debug$S
          14 .text

CPU Version

The research for this article was performed on a Microsoft Surface RT (Generation 1) running on an Nvidia TEGRA 3 Quad Core CPU. The Secure Boot Signing Policy that retail devices like Surface RT ship with does not allow live kernel debugging. It is however possible to configure Surface RT devices to generate complete kernel memory dumps and these memory dumps can be loaded and analyzed on both the X86 and X64 versions of WinDBG. So all the research for this article was done using kernel mode and user mode memory dumps generated on the Surface RT device.

User mode dumps on the Surface RT device were generated by simply using Task Manager's "Create dump file" option.

To generate a complete kernel memory dump on a Surface RT system the commands listed below were run from an administrative command prompt, followed by a system reboot and finally bug-checking the system using the RightCtrl+ScrollLock+ScrollLock key sequence, as described in [8].

wmic recoveros set DebugInfoType = 1
reg add "HKLM\SYSTEM\CurrentControlSet\Services\kbdhid\Parameters" /v CrashOnCtrlScroll  /t REG_DWORD /d 0x1

Loading up a kernel complete memory dump generated on the Surface RT in WinDBG v6.3.9600 for X86/X64 shows the Windows 8 kernel was running on a quad-core ARM CPU in Thumb-2 mode.

Loading Dump File [MEMORY.DMP]
Kernel Bitmap Dump File: Full address space is available

************* Symbol Path validation summary **************
Response Time (ms) Location
Deferred           SRV*c:\SYMBOLS*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\SYMBOLS*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows 8 Kernel Version 9200 MP (4 procs) Free ARM (NT) Thumb-2
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 9200.16420.armfre.win8_gdr.120919-1813
Machine Name:
Kernel base = 0x8340c000 PsLoadedModuleList = 0x835d08c0

The "!sysinfo cpuinfo" command describes the CPU as ARM Family 7 Cortex-A9 r02p09. Based on this information the ARM Cortex-A9 Technical Reference Manual [3] and the ARM Architecture Reference Manual for ARM-v7-A and ARM-v7-R [4] were used to research this article.

0: kd> !sysinfo cpuinfo
[CPU Information]
~MHz = REG_DWORD 1300
Component Information = REG_BINARY 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Configuration Data = REG_FULL_RESOURCE_DESCRIPTOR ff,ff,ff,ff,ff,ff,ff,ff,0,0,0,0,0,0,0,0
Identifier = REG_SZ ARM Family 7 Model C09 Revision 209
ProcessorNameString = REG_SZ NVIDIA(R) TEGRA(R) 3 Quad Core CPU
VendorIdentifier = REG_SZ NVIDIA

Registers

The ARM CPU can execute in User, System, Supervisor, Abort, Undefined, Interrupt (IRQ) and Fast Interrupt (FIQ) modes. In total the ARM CPU has 37 physical registers, each one 32-bits wide. Out of these 37 registers, only 17 registers are visible to software at any given point in time, depending on the mode the CPU is executing in. These registers comprise of thirteen general-purpose registers (r0 to r12) and three special purpose registers (r13- r15) and the CPU Program Status Register (CPSR). The special purpose registers r13, r14, r15 are also referred to as SP, LR, and PC respectively. The CPSR register is similar to the X86/X84 flags register. Unlike the X86, ARM does not contain any segment registers.

The table below lists the ARM CPU registers and their usage.

RegisterDescription
r0 Contains the 1st parameter passed to functions. 32-bit function return value, similar to the EAX register on X86. Low-word of 64-bit function return value.
r1 Contains the 2nd parameter passed to functions. High-word of 64-bit function return value.
r2 Contains the 3rd parameter passed to functions.
r3 Contains the 4th parameter passed to functions.
r4-r10 General purpose registers, callee saved.
r11 Frame Pointer, similar to the EBP register on X86.
r12 General purpose register.
r13 (SP) Stack Pointer, similar to X86 ESP, callee saved.
r14 (LR) Link Register, contains the return address during a function call. Callee saved for non-leaf functions.
Leaf functions don't save this register since they don't modify it.
r15 (PC) Program Counter, similar to EIP on X86.
CPSR Current Program Status Register. Similar to the EFlags register on X86.
APSR Application Program Status Register. This is not a separate register but the NZCVQ and GE bits of the CPSR that are writable from user mode.
SPSR Saved Program Status Register. Copy of the CPSR at the time an exception occurs. SPSR contains the pre-exception value of the CPSR. The CPU contains a separate instance of the SPSR for every exception mode that is supported by the ARM CPU.

Of the 17 registers mentioned above, r0-r7 and r15 are unbanked registers i.e. they map to the same physical registers irrespective of the mode the CPU is executing in. Registers r8 through r14 are banked i.e. they map to different physical registers depending on the CPU's execution mode. The purpose of banked registers is for the CPU to automatically save and restore these register contents across execution mode changes and ensure that the registers are not overwritten during an exception. Registers r13 and r14 are banked in all execution modes except in System Mode. Registers r8–r12 are banked only in FIQ mode. In addition, the CPSR register is banked into the SPSR registers in all modes, expect in System Mode. The ARM documentation refers to banked registers with the suffixes svc, abt, und, irq or fiq representing the execution modes of the CPU in which the registers are used.

The following table shows the banked and unbanked registers in all of the different execution modes of the CPU:

UserSystemSupervisorAbortUndefinedIRQFIQ
r0r0r0r0r0r0r0
r1r1r1r1r1r1r1
r2r2r2r2r2r2r2
r3r3r3r3r3r3r3
r4r4r4r4r4r4r4
r5r5r5r5r5r5r5
r6r6r6r6r6r6r6
r7r7r7r7r7r7r7
r8r8r8r8r8r8r8_fiq
r9r9r9r9r9r9r9_fiq
r10r10r10r10r10r10r10_fiq
r11r11r11r11r11r11r11_fiq
r12r12r12r12r12r12r12_fiq
SPSPSP_svcSP_abtSP_undSP_irqSP_fiq
LRLRLR_svcLR_abtLR_undLR_irqLR_fiq
PCPCPCPCPCPCPC
CPSRCPSRSPSR_svcSPSR_abtSPSR_undSPSR_irqSPSR_fiq

The list of ARM registers can be examined using the debugger's register display command:

0: kd> r
r0=00000000 r1=00000001 r2=835c3d3c r3=00000000 r4=000000e2 r5=835df580
r6=00000000 r7=00000002 r8=00000000 r9=00000000 r10=00000000 r11=82b30890
r12=912ca010 sp=82b305e0 lr=00000000 pc=835364e0 psr=60000053 -ZC-- ARM
nt!KeBugCheck2+0xfc:
835364e0 f1150020 adds r0,r5,#0x20

Current Program Status Register (CPSR)

The following figure shows the format of the CPSR register.

FIG#1
Figure 1 : ARM CPSR Register Format

The five mode bits M[4:1] contain the values listed in the following table indicating the mode CPU is currently operating in :

ModeValueDescription
USR0x10User Mode
FIQ0x11FastInterrupt Mode
IRQ0x12Interrupt Mode
SVC0x13Supervisor Mode
ABT0x17Abort Mode
UDF0x1BUndefined Mode
SYS0x1FSystem Mode

The J & T bits determine if the CPU is in ARM or Thumb mode, where J = Jazelle and T = Thumb.

The NZCV bits are used by conditional flow control instructions to alter program execution based on the result of compare operations. These bits are set by instructions like cmp, tst, or any other instruction that has an "S" suffix. The 2-letter acronyms in the Condition column in the following table are used as suffixes to branch instructions. The Flags column shows the value of one or more condition bits that would result in the corresponding branch being taken. Examples of such conditional flow control instructions are Conditional Compare and Branch (CBxx) and Conditional Branch (Bxx) and its variants. The xx is the condition suffix as shown below:

Code Condition Flags Description
0000 (0)EQZ == 1Equal
0001 (1)NEZ == 0Not equal
0010 (2)CSC == 1Carry set
0011 (3)CCN == 1Carry clear
0100 (4)MIN == 1Description
0101 (5)PLN == 0Plus, positive or zero
0110 (6)VSV == 1Overflow
0111 (7)VCV == 0No overflow
1000 (8)HI(C == 1) && (Z == 0)Unsigned higher
1001 (9)LS(C == 0) || (Z == 1)Unsigned lower or same
1010 (a)GEN == VSigned greater than
1011 (b)LTN != VSigned less than
1100 (c)GT(Z == 0) && (N == V)Signed greater than
1101 (d)LE(Z == 1) || (N != V)Unsigned less than or equal
1110 (e)ALAnyAlways (unconditional)

Trap Frames

The trap frame structure (KTRAP_FRAME) is used by Windows to save and restore register contents during interrupts, system calls and exceptions. Due to the use of banked registers the ARM CPU does not push anything on the stack during an exception, hence the trap frame on the ARM CPU is entirely defined by software. The trap frame structure for ARM CPU, defined in ntddk.h, is as follows:

0: kd> dt nt!_KTRAP_FRAME
   +0x000 Arg3             : Uint4B
   +0x004 FaultStatus      : Uint4B
   +0x008 FaultAddress     : Uint4B
   +0x008 TrapFrame        : Uint4B
   +0x00c Reserved         : Uint4B
   +0x010 ExceptionActive  : UChar
   +0x011 ContextFromKFramesUnwound : UChar
   +0x012 DebugRegistersValid : UChar
   +0x013 PreviousMode     : Char
   +0x013 PreviousIrql     : UChar
   +0x014 VfpState         : Ptr32 _KARM_VFP_STATE
   +0x018 Bvr              : [8] Uint4B
   +0x038 Bcr              : [8] Uint4B
   +0x058 Wvr              : [1] Uint4B
   +0x05c Wcr              : [1] Uint4B
   +0x060 R0               : Uint4B
   +0x064 R1               : Uint4B
   +0x068 R2               : Uint4B
   +0x06c R3               : Uint4B
   +0x070 R12              : Uint4B
   +0x074 Sp               : Uint4B
   +0x078 Lr               : Uint4B
   +0x07c R11              : Uint4B
   +0x080 Pc               : Uint4B
   +0x084 Cpsr             : Uint4B

The kernel debugger .trap command switches the debugger's register context to the given trap frame and displays the contents of the trap frame as shown below:

0: kd> .trap 9f40bd40
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
 r0=00000001  r1=00e8fa70  r2=00000001  r3=00000000  r4=00000000  r5=00000000
 r6=00000000  r7=00000000  r8=00000000  r9=00000000 r10=00000000 r11=00e8fa40
r12=00000059  sp=00e8f8d0  lr=754c0c4d  pc=7787e496 psr=00000030 ----- Thumb
7787e496 ??       ???

As highlighted by the "NOTE:" in the above output, the trap frame structure does not contain fields for non-volatile registers i.e. R4-R10. At the time of an exception the non-volatile registers are saved in another structure called the KEXCEPTION_FRAME. The KEXCEPTION_FRAME structure is not exposed through public symbols but it is defined in ntddk.h. The macros GENERATE_EXCEPTION_FRAME and RESTORE_EXCEPTION_FRAME are defined in the WDK Header file kxarm.h. These macros are used at the beginning and end of functions respectively to setup and tear down the EXCEPTION_FRAME structures on the stack.

In addition to the CPU registers described above, the KTRAP_FRAME also contains a copy of the CPU's Breakpoint Value registers (Bvr) and the Breakpoint Control Registers (Bcr) which control the configuration and usage of the Bvrs. The KTRAP_FRAME also contains a copy of the CPU's Watchpoint Value Registers (Wvr) and the Watchpoint Control Registers (Wcr) which control the configuration and usage of the Wvrs. All of the breakpoint and watchpoint registers reside in co-processor CP14, more on co-processors later. The maximum number of breakpoints and watch points that are available on a CPU are defined in hardware and these values are cached in the Kernel Processor Control Region (KPCR) structure. The fields KPCR.MaxBreakpoints and KPCR.MaxWatchpoints cache the maximum number of breakpoints and watchpoints respectively. The content of these fields in the KPCR structure is shown below:

0: kd> !pcr
KPCR for Processor 0 at 835df000:
    Major 1 Minor 1
Panic Stack 00000000
Dpc Stack 00000000
Irql addresses:
    Mask    835df000
    Table   835df000
    Routine 835df000

0: kd> dt nt!_KPCR 835df000 -y Prcb.Max
   +0x580 Prcb     : 
      +0x510 MaxBreakpoints : 6
      +0x514 MaxWatchpoints : 1

The trap frame also optionally points to the Vector Floating Point (VFP) registers, these registers reside in co-processor CP10. These registers are used as either 64-bit "D" floating point registers or as the NEON 128-bit SIMD or "Q" registers. These "D" and "Q" registers are aliased and they map to the same physical bits in the VFP. The VFP register values can be read using the VTSM instruction and written to using the VLDM instruction.

The debugger's default register mask on the ARM i.e. 0x01 causes the 'r' command to display only the integer registers. The other registers described above can be examined by setting the register mask to 0x4f as shown below:

0: kd> rm
Register output mask is 1:
       1 - Integer state (32-bit)

0: kd> rm 4f

0: kd> rm
Register output mask is 4f:
       2 - Integer state (64-bit)
       4 - Floating-point state
       8 - CP14 Debug registers
      40 - NEON registers

0: kd> r
 r0=00000000  r1=00000001  r2=835c3d3c  r3=00000000  r4=000000e2  r5=835df580
 r6=00000000  r7=00000002  r8=00000000  r9=00000000 r10=00000000 r11=82b30890
r12=912ca010  sp=82b305e0  lr=00000000  pc=835364e0 psr=60000053 -ZC-- ARM

 d0 =    1.55693191866e-020  d1 =   -6.89514063657e+017
 d2 =   -1.70274371354e+110  d3 =    -1.1722284665e-024
 d4 =    1.15599243873e-030  d5 =    1.91880731274e-200
 d6 =   -4.84612895168e-070  d7 =   -1.05193623686e-280
 d8 =                     0  d9 =                     0
 d10=                     0  d11=                     0
 d12=                     0  d13=                     0
 d14=                     0  d15=                     0
 d16=   -1.35814759421e+230  d17=   -2.53691079801e+193
 d18=   -2.53693113536e+193  d19=   -1.79419100154e+152
 d20=    1.34188733727e-181  d21=    2.50542679408e-076
 d22=    2.89674605141e-025  d23=   -4.60295648278e+263
 d24=   -9.19007555787e+178  d25=   -1.05193623686e-280
 d26=   -4.84612895168e-070  d27=    1.91880731274e-200
 d28=    2.37038032292e-049  d29=   -3.79910152647e-130
 d30=     4.1280444982e+244  d31=   -2.27199129623e-207
 fpscr=00000000
q00=-326.276 -16374.9 0.00642032 -0.000336106
q01=-0.00188197 1.36997e-020 -1.15518e+014 1.3477e-020
q02=1.99929e-025 7.3251e-016 0.000349896 -5.21864e-037
q03=-1.9424e-035 4.91647e-020 -4.03837e-009 1.39308e-021
q04=0 0 0 0
q05=0 0 0 0
q06=0 0 0 0
q07=0 0 0 0
q08=-2.83799e+024 0 -1.12897e+029 7.91201e-038
q09=-2.00905e+019 -0.0198441 -2.83799e+024 -2.13538e-010
q10=6.87617e-010 -0.0600408 4.66981e-023 3.73876e+036
q11=-1.75682e+033 26.0458 0.00163584 -1.13635
q12=-1.9424e-035 4.91647e-020 -4.44726e+022 4.93765e-007
q13=1.99929e-025 7.3251e-016 -4.03837e-009 1.39308e-021
q14=-1.25641e-016 -6.14369e-017 1.5957e-006 -9.31384e+024
q15=-2.7332e-026 -2.64336e+011 7.29623e+030 -3.44372e-010

kbvr[0] =00000000  kbcr[0] =00000000
kbvr[1] =00000000  kbcr[1] =00000000
kwvr[0] =00000000  kwcr[0] =00000000
nt!KeBugCheck2+0xfc:
835364e0 f1150020 adds        r0,r5,#0x20

Instruction Set

Windows, like all other modern operating systems, uses the ARM CPU in Thumb-2 mode in which instructions are either 16 bits (Thumb) or 32 bits (ARM). Thumb mode, which was introduced in early ARM processors, allows for higher instruction density and uniform instruction coding but these instructions are limited in functionality as compared to their 32-bit ARM counterparts. Here are some of the limitations:

Thumb-2, introduced in modern ARM processors, allows these limitations to be worked around by enabling compilers and the processor to generate and understand functions which combine both Thumb and ARM instructions in the same instruction stream, without requiring branch instructions to switch from one mode to the other.

During a branch operation the ARM CPU must be told that the target of the branch is a Thumb-2 instruction. This is indicated by setting the least significant bit of the branch address. As a consequence of this when a function pointer is examined in WinDBG it always points at a one byte offset within the function as illustrated below:

0: kd> x nt!IopErrorLogWorkItem
835d84d0          nt!IopErrorLogWorkItem = <no type information>

0: kd> dt 835d84d0 nt!_WORK_QUEUE_ITEM
   +0x000 List             : _LIST_ENTRY [ 0x0 - 0x0 ]
   +0x008 WorkerRoutine    : 0x836c8189     void  nt!IopErrorLogThread+0
   +0x00c Parameter        : (null) 

0: kd> u 0x836c8189
nt!IopErrorLogThread+0x1:
836c8188 e92d4ff0 push        {r4-r11,lr}
836c818c f20d0b1c addw        r11,sp,#0x1C
836c8190 f618fbb8 bl          nt!_security_push_cookie (834e0904)
836c8194 f2ad6d58 subw        sp,sp,#0x658
836c8198 2300     movs        r3,#0
836c819a 930b     str         r3,[sp,#0x2C]
836c819c 930c     str         r3,[sp,#0x30]
836c819e f7fffc1b bl          nt!IopErrorLogConnectSession (836c79d8)

The function nt!IopErrorLogThread begins at address 0x836c8188, however the field WORK_QUEUE_ITEM.WorkerRoutine contains the address 0x836c8189 which has the least significant bit is set indicating a Thumb-2 instruction stream.

Instruction Encoding

Since instruction sizes in ARM Thumb-2 mode can be both 16 and 32 bit, the way an instruction is encoded plays a critical role in determining the actual instruction size. 32-bit instructions are encoded as 2 separate 16-bit half-words. The value of bits[15:11] of the first half-word determines if the instruction is made of a single half-word (16 bits) or double half-word (32-bits). If the value of bits[15:11] of the first half-word are either 11101 or 11110 or 11111, the half-word is the first half-word of a 32-bit instruction otherwise it is a 16-bit instruction.

Here is an example of an instruction encoded with a 32-bit operand:

0:000> u 77b485de L1
ntdll!TppWorkerThread+0x92:
77b485de f3bf8f5b dmb         ish

The opcode for “dmb ish” “f3bf8f5b” is made of 2 16-bit numbers as illustrated below, with the opcode displayed as a single word (32-bit) and and two half-words (16-bit).

0:000> dd 77b485de  L1
77b485de  8f5bf3bf

0:000> dw /c1 77b485de  L2
77b485de  f3bf
77b485e0  8f5b

The following excerpt from the ARMv7 Architecture Reference Manual Section A8.8.43 shows the encoding of the above mentioned DMB instruction in Thumb-2 mode.

FIG#2
Figure 2 : ARM 32-bit instruction encoding

The first (lower) 16 bit part of the opcode (0xf3bf) is represented by the binary number "1111 0011 1011 1111" which matches the first half of the instruction encoding.

The second (higher) 16 bit part of the opcode (0x8f5b) is represented by the binary number "1000 1111 0101 1011" which matches the second half of the instruction encoding. The “option” value is binary 1011, and specifies the ISH option to the DMB instruction as shown below:

FIG#3
Figure 3 : ISH option of DMB instruction

Here is an example of an instruction encoded with a 16-bit operand:

0: kd> u 834daa62 L1
nt!KiIdleLoop+0x3e:
834daa62 bf10     yield

0: kd> .formats bf10
Evaluate expression:
  Hex:     0000bf10
  Decimal: 48912
  Octal:   00000137420
  Binary:  00000000 00000000 10111111 00010000
  Chars:   ....
  Time:    Thu Jan 01 08:35:12 1970
  Float:   low 6.85403e-041 high 0
  Double:  2.41657e-319

The opcode (0xbf10) is represented by the binary number 1011 1111 0001 0000, which matches the instruction bit encoding shown below.

FIG#4
Figure 4 : ARM 16-bit instruction encoding

Instructions on the ARM CPU have different variants depending on the prefix that follows the primary mnemonic. These prefixes can be S, W, or .W and determines how the instruction is encoded, whether CPSR are affected and how some of the operands are interpreted.

Following are some variants of the ADD instruction with the same operands encoded differently based on the suffix immediately following the instruction mnemonic. The first column is the opcode for the instruction.

F1010200    add         r2,r1,#0
F1010200    add.w       r2,r1,#0
F2010200    addw        r2,r1,#0
1C0A        adds        r2,r1,#0
F1110200    adds.w      r2,r1,#0

Barrel Shifter

The ARM instruction set has the capability to combine shift and rotate operations along with arithmetic, logical, compare, load and store operations in a single instruction. This is achieved through the barrel shifter, a hardware logic unit in the CPU shown below:

FIG#5
Figure 5 : ARM Barrel Shifter

The barrel shifter implements shift and rotate operations that can be of arithmetic or logical type like:

Examples of instructions that use the barrel shifter:

ea445302 orr        r3,r4,r2,lsl #0x14
eb033412 add        r4,r3,r2,lsr #0xC

The ORR instruction performs a Logical Shift Left (LSL) of register r2 by 20 positions. The resulting operation becomes r3 = LogicalOR ( r4, LogicalShiftLeft ( r2, 0x14) ).

The ADD instruction performs a Logical Shift Right (LSR) of register r2 by 12 positions. The resulting operation becomes r4 = Add ( r3, LogicalShiftRight ( r2, 0xc) ).

Instruction Ordering

Modern compilers attempt to optimize program execution by generating instruction sequences which may be different from what was intended by the high level programming language.

Modern CPUs also perform multiple run time optimizations like instruction pipelining, write buffering, instruction and data caching, speculative execution and out of order execution. While these optimizations result in faster program execution, there are cases where they may lead to undesirable results. This is especially true for low level operations performed by the OS like cache operations, TLB flushes, page table updates and device register accesses. Barriers prevent both the compiler and CPU from performing the above mentioned optimizations.

The ARM CPU documentation uses the term barrier to refer to CPU optimization prevention. There are 3 different types of barriers that can be used on the ARM CPU.

Instr.Barrier TypeDescription
dmb Data Memory Barrier Ensures that all explicit memory accesses before the DMB instruction complete before any explicit memory accesses after the DMB instruction start. The DMB instruction is automatically inserted by the compiler whenever any Interlocked family of functions are used in C or C++. Additionally declaring a global variable as volatile results in the compiler generating DMB instructions provided the file is compiled with the /volatile:ms, instead of the /volatile:iso option.
dsb Data Synchronization Barrier Completes when all instructions before this instruction complete. The DSB instruction can be directly inserted using the macro _DataSynchronizationBarrier() which is defined in winnt.h.
isb Instruction Synchronization Barrier Flushes the pipeline in the CPU, so that all instructions following the ISB are fetched from cache or memory, after the ISB has been completed. The ISB instruction can be directly inserted using the macro _InstructionSynchronizationBarrier() which is defined in winnt.h.

The scope of these barrier instructions can be restricted to sharing domains as well as to specific memory access types. These can specified optionally as instruction suffixes to the barrier instructions. If a barrier instruction does not have a suffix its scope is assumed to be system wide and it applies to both read and write type memory accesses.

Sharing DomainSuffixDescription
Non-Shareable NSH Per-Core TLBs
Inner Shareable ISH System Memory
Outer Shareable OSH Device Memory
Full System SY or ST System and Device Memory

Access TypeSuffixComments
Read and Write None For full system read and write access, the sharing domain and access is combined into the suffix SY.
Write only ST For full system write only access, the sharing domain and access is combined into the suffix ST.

The following annotated code snippet shows the usage of the ISB instruction to perform a pipeline flush before updating the exception handling settings on the ARM CPU and another one after the update to fetch subsequent instructions directly from memory.

0: kd> uf nt!KiInitializeExceptionVectorTable
nt!KiInitializeExceptionVectorTable:
834a1b40 4b07     ldr         r3,=nt!KiArmExceptionVectors+0x1 (834dc6a1)
834a1b42 f0330301 bics        r3,r3,#1
834a1b46 ee0c3f10 mcr         p15,#0,r3,c12,c0 ; r3 = Vector Base Address Register(VBAR)
834a1b4a f3bf8f6f isb
834a1b4e ee113f10 mrc         p15,#0,r3,c1,c0  ; System Control Register(SCTLR)
834a1b52 f4335300 bics        r3,r3,#0x2000    ; SCTLR.V = 0 ; Use VBAR + Low Offset 
834a1b56 ee013f10 mcr         p15,#0,r3,c1,c0  ; Vector Base Address Register(VBAR) = r3
834a1b5a f3bf8f6f isb
834a1b5e 4770     bx          lr

Interlocked Operations

Unlike the X86 and X64 CPUs, which use the lock prefix before instructions to make them atomic across multiple CPUs, the ARM CPU uses LDREX and STREX and its variants to implement interlocked operations. The LDREX and STREX instructions are used in pairs but there can be other intervening instructions between them.

The following code snippet shows the assembly instructions generated by the compiler during a call to the function InterlockedIncrement ( &g_Lock );.

004010cc f3bf8f5b dmb ish
004010d0 4905     ldr r1,=g_Lock (00419124)
004010d2 e8512f00 ldrex r2,[r1]
004010d6 3201     adds r2,#1
004010d8 e8412300 strex r3,r2,[r1]
004010dc 2b00     cmp r3,#0
004010de d1f8     bne 004010d2
004010e0 f3bf8f5b dmb ish
004010e4 4770     bx lr

In the above function, the combination of the instructions LDREX and STREX form an atomic read modify/write pair with the intervening adds instruction performing the value increment.

The following snippet describes the functionality of the LDREX and STREX instructions.

LDREX r2,[r1] performs the following steps:
    R2 = *R1
    Place an exclusive lock on address R1
STREX r3,r2,[r1] performs the following steps:
if ( exclusivelock is held )
    *R1 = R2
    R3 = 0
else // no exclusive lock
    R3 = 1

In the STREX example above the R3 register contains success (0) or failure (1) depending on whether R2 was stored in memory pointed to by R1.

Commonly Used Instructions

This section lists the most common instructions that are encountered in functions on the WoA platform. Familiarity with these instructions helps in reading and understanding most of the assembler code generated by the Visual Studio compiler targeting WoA. Instruction opcodes are included to clearly distinguish between 16 and 32 bit Thumb-2 instructions.

Arithmetic Instructions
OpcodeInstructionOperation
1ac0 subs r0,r0,r3 Subtract. r0 = r0 - r3
eb0604c0add r4,r6,r0,lsl #3Add with Shift. r4 = r6 + LeftShift (r0, 3)
f20D0b08addw r11,sp,#8Add. r11 = sp + 0x8. The .w forces 32-bit opcode generation.
424d rsbs r5,r1,#0Reverse Subtract. r5 = 0 - r1.
b21b sxth r3,r3Signed Extend Half-word. r3 = SignExtend16To32Bit(r3). Similar to X86 MOVSX instruction.
f2c00a61movt r10,#0x61Move to Top Half. r10[31:16] = 0x61.
Logical Instructions
OpcodeInstructionOperation
f0530302orrs r3,r3,#2Bitwise OR. r3 = r3 | 0x02
ea834271eor r2,r3,r1,ror #0x11Bitwise XOR. r2 = r3 ^ RotateRight(r1,11)
f06f0200mvn r2,#0Bitwise NOT. r2 = ~(0x0)
4013 ands r3,r3,r2Bitwise AND. r3 = r3 & r2
105b asrs r3,r3,#1Arithmetic Shift Right. r3 = r3 >> 1
f033043fbics r4,r3,#0x3fBitwise Bit Clear. r4 = r3 & (~0x3f)
f36f040bbfc r4,#0,#0xCBit Field Clear, sets the specified bit range to zero. r4[11:0] = 0.
fa94f3a4rbit r3,r4Reverse Bits. r3[31:0] = r4[0:31]
f3c30644ubfx r6,r3,#1,#5Unsigned Bit Field Extract. r6 = ZeroExtend(r3[5:1]). Extract Bits 1 through 5 from r3, zero extend the result and store in r6.
Branch
OpcodeInstructionOperation
e002 b 83456484PC Relative Branch. Similar to X86 jmp instruction.
f7fefc0ebl 83454c44PC Relative Branch and Link. LR = Address of next instruction. Similar to the X86 call instruction.
4770 bx lrBranch to LR. PC=LR. Similar to X86 ret instruction.
4798 blx r3Branch with Link and Exchange. PC=R3, LR = Address of next instruction. Similar to BL except that BLX can change instruction set from ARM to Thumb, or vice versa.
f02aaa9bbeq 835236d0PC Relative Conditional Branch if equal. If (CPSR.Z == 1) PC = BranchTarget. Similar to the X86 JZ instruction.
d067 beq 83429a36PC Relative Conditional Branch if equal. If (CPSR.Z == 1) PC = BranchTarget. Similar to the X86 JZ instruction. Since the opcode for this instruction is 32-bits its target range is much larger than the previous instruction.
bb6b cbnz r3,83429d80PC Relative Compare and Branch on Nonzero. if ( R3 != 0 ) PC = BranchTarget. The range of such branches is +4 to +130 bytes.
Compare and Test
OpcodeInstructionOperation
f0130f10tst r3,#0x10Set flags based on bitwise AND operation. CPSR.Flags = r3 & 0x10
ea930f00teq r3,r0Set flags based on bitwise XOR operation. CPSR.Flags = r3 ^ r0
2800 cmp r0,#0Set flags based on subtraction operation. CPSR.Flags = r0 - ZeroExtend(0x0). The immediate operand is zero extended to make it 32-bits wide.
f1150f02cmn r5,#2Set flags based on addition operation. CPSR.Flags = r0 + 0x2
Data Movement
OpcodeInstructionOperation
781B ldrb r3,[r3]Load Register Byte. r3 = ZeroExtend(*r3). Similar to the X86 mov byte ptr instruction.
69bb ldr r3,[r7,#0x18]Load Register. r3 = *(r7+0x18)
534B strh r3,[r1,r5]Store Register Halfword. *(r1+r5) = r3[15:0]
F8858166strb r8,[r5,#0x166]Store Register Byte. *(r5+0x166) = r8[7:0]
e92d48b8push {r3-r5,r7,r11,lr}Save registers r3,r4, r5, r7, r11, r14 to the stack and decrement SP
e8bd8800pop {r11,pc}Restore register r11 and r15 from the stack and increment SP
4642 mov r2,r8r2=r8
Special Instructions

Windows use the ARM CPU's capability of generating exceptions on undefined instructions to process “well known” undefined instructions which are essentially opcodes that are construed as undefined by ARM but convey meaning to the Window's exception handling mechanism. 16-bit instructions starting with a 0xDE are undefined and lead to an Undefined Instruction exception which is handled by nt!KiUndefinedInstructionException. While executing an undefined instruction, the CPSR.Mode is set to 11011b i.e. Undefined.

KiUndefinedInstructionException() directly handles certain undefined instructions like __ rdpmccntr64, but for the rest, it simply dispatches the exception to KiDispatchException() which in turns calls KiPreprocessInternalInvalidOpcode(). WoA uses the following undefined instructions:

OpcodeMnemonicDescription
0xDEFE__debugbreakBreaks into the debugger. Used by ntdll!DbgUserBreakPoint().
0xDEFC__assertfailUsed to indicate critical assertion failures in the kernel debugger. Used by KeAccumulateTicks()
0xDEFB__fastfailIndicates fast fail conditions resulting in KeBugCheckEx(KERNEL_SECURITY_CHECK_FAILURE). Called by functions like InsertTailList() upon detecting a corrupted list, as described in [9].
0xDEFA__rdpmccntr64Reads the 64-bit performance counter co-processor register and returns the value in R0+R1. Used by ReadTimeStampCounter(), KiCacheFlushTrial() etc.
0xDEFD__debugserviceInvoke debugger breakpoint. Used by DbgBreakPointWithStatusEnd(), DebugPrompt() etc.
0xDEF9__brkdiv0Divide By Zero Exception, used by functions like nt!_rt_udiv and nt!_rt_udiv. Also generated by the compiler to check the divisor before division operations.

Calling Convention

The ARM CPU and the X64 CPU have very similar calling conventions in that the first four parameters to a function are passed via registers. However, unlike the X64 that has a register spill space, the ARM compiler does not reserve any space on the stack for register based parameters. Another similarity between X64 and ARM is that only the function prolog and epilog modify the value of the stack pointer (SP), the function body never changes SP. The registers used for parameter passing on the ARM CPU are listed below:

The following figure shows assembler code sequence during a function call.

FIG#7
Figure 7 : Function Parameters

Function Prolog and Epilog

The following code snippet is an example of instructions that typically make up the prolog of a non-leaf function:

nt!ExAllocatePoolWithTag:
835a7000 e92d4ff0 push        {r4-r11,lr}
835a7004 f20d0b1c addw        r11,sp,#0x1C
835a7008 b08f     sub         sp,sp,#0x3C

The push instruction above saves the volatile registers R4, R5, R6, R7, R8, R9, R10, R11 and LR (R15) on the stack. LR (R15) is used to return execution control back to the caller.
The addws sets up the r11 register to point to the location of the stack where the old r11 register was saved. This creates a frame pointer chain similar to the one created on the X86 with the EBP register.
And finally the sub instruction creates space on the stack for local variables.

The corresponding function epilog is shown below:

nt!ExAllocatePoolWithTag+0x98:
835a7098 b00f     add         sp,sp,#0x3C
835a709a e8bd8ff0 pop         {r4-r11,pc}

The add instruction in the above snippet simply adjusts the stack pointer to skip over the local variables.
The pop instruction restores back the contents of the non-volatile registers which were saved in the function prolog.
The value of the saved LR register (i.e. the return address) is restored back into the PC, thus returning control back to the caller and obviating the need for an explicit branch instruction.

FIG#8
Figure 8 : Function Prolog and Epilog

The prolog and epilog for leaf functions (i.e. function that don't call others) are very different from the sequence shown above. Following is the complete disassembly of a non-leaf function:

nt!IopGetDeviceAttachmentBase:
83456478 f8d030b0 ldr         r3,[r0,#0xB0]
8345647c e002     b           nt!IopGetDeviceAttachmentBase+0xc (83456484)
8345647e 4618     mov         r0,r3
83456480 f8d330b0 ldr         r3,[r3,#0xB0]
83456484 699b     ldr         r3,[r3,#0x18]
83456486 2b00     cmp         r3,#0
83456488 d1f9     bne         nt!IopGetDeviceAttachmentBase+0x6 (8345647e)
8345648a 4770     bx          lr

In the code snippet shown above, the LR register contains the return address of the caller upon entry. Since this function does not modify the LR register contents, returning to the caller simply involves branching to LR i.e. "bx lr".

Function Disassembly Walkthrough

To tie together all the concepts introduced above, this section provides a complete annotated listing of the user mode function CreateFileA() in kernelbase.dll.

Here is the prototype of CreateFileA() along with the registers and stack locations that would contain the parameters passed in by the caller.

HANDLE WINAPI 
CreateFile(
  LPCTSTR lpFileName,                           ; P1 = r0
  DWORD dwDesiredAccess,                        ; P2 = r1
  DWORD dwShareMode,                            ; P3 = r2
  LPSECURITY_ATTRIBUTES lpSecurityAttributes,   ; P4 = r3
  DWORD dwCreationDisposition,                  ; P5 = stack sp[0] 
  DWORD dwFlagsAndAttributes,                   ; P6 = stack sp[4]
  HANDLE hTemplateFile );                       ; P7 = stack sp[8]

The term "callee" used in the following code snippet refers to the function CreateFileW() which is called by CreateFileA(). Figure 9 depicts the state of the stack after the sub instruction has executed i.e. prolog for CreateFileA() has completed.

0:000> uf kernelbase!CreateFileA
KERNELBASE!CreateFileA:
757a1628 e92d4870 push        {r4-r6,r11,lr} ; save only those non-volatile 
                                             ; registers that will be overwritten
757a162c f20d0b0c addw        r11,sp,#0xC    ; point r11 to the location on the stack 
                                             ; where callers r11 (frame pointer) is stored
757a1630 b087     sub         sp,sp,#0x1C    ; create space for local variables (0x10) bytes 
                                             ; and for parameters to callees (0xc) bytes
757a1632 460e     mov         r6,r1          ; r6 = r1 = dwDesiredAccess(caller P2)
757a1634 4601     mov         r1,r0          ; r1 = r0 = lpFileName(caller P1)
757a1636 a804     add         r0,sp,#0x10    ; r0 = sp + 0x10
757a1638 461c     mov         r4,r3          ; r4 = r3 = lpSecurityAttributes(caller P4)
757a163a 4615     mov         r5,r2          ; r5 = r2 = dwShareMode(caller P3)
757a163c f000fc1a bl          KERNELBASE!Basep8BitStringToDynamicUnicodeString (757a1e74)
757a1640 b1a0     cbz         r0,KERNELBASE!CreateFileA+0x44 (757a166c) ; if the return value 
                                             ; from the previous function call (r0) is 0 then 
                                             ; goto  757a166c (exit)
KERNELBASE!CreateFileA+0x1a:
757a1642 9b0e     ldr         r3,[sp,#0x38]  ; r3 = *(sp+0x38) = hTemplateFile(caller P7)
757a1644 9805     ldr         r0,[sp,#0x14]  ; r0 = *(sp+0x14) Local = lpFileName(callee P1)
757a1646 462a     mov         r2,r5          ; r2 = r5 = dwShareMode(caller P3)
757a1648 9302     str         r3,[sp,#8]     ; *(sp+0x8) = r3 = hTemplateFile(callee P7)
757a164a 9b0d     ldr         r3,[sp,#0x34]  ; r3 = *(sp+0x34) = dwFlagsAndAttributes(caller P6)
757a164c 4631     mov         r1,r6          ; r1 = r6 = dwDesiredAccess(callee P2)
757a164e 9301     str         r3,[sp,#4]     ; *(sp+0x4) = r3 = dwFlagsAndAttributes(callee P6)
757a1650 9b0c     ldr         r3,[sp,#0x30]  ; r3 = *(sp+0x30) = dwCreationDisposition(caller P5)
757a1652 9300     str         r3,[sp]        ; *(sp+0x0) = r3 =  dwCreationDisposition(callee P5)
757a1654 4623     mov         r3,r4          ; r3 = r4 = lpSecurityAttributes(callee P4)
757a1656 f000fe61 bl          KERNELBASE!CreateFileW (757a231c) ; invoke callee i.e. CreateFileW()
757a165a 4b06     ldr         r3,=KERNELBASE!_imp_RtlFreeUnicodeString (758119c0)
757a165c 4604     mov         r4,r0          ; r4 = r0 = return value from CreateFileW()
757a165e a804     add         r0,sp,#0x10    ; r0 = sp + 0x10 = AnsiString = P1 to RtlFreeAnsiString()
757a1660 681b     ldr         r3,[r3]        ; r3 = ntdll!RtlFreeAnsiString
757a1662 4798     blx         r3             ; call RtlFreeAnsiString()

KERNELBASE!CreateFileA+0x3c:
757a1664 4620     mov         r0,r4          ; CreateFileA() return value in r0
757a1666 b007     add         sp,sp,#0x1C    ; free locals and parameter space on stack
757a1668 e8bd8870 pop         {r4-r6,r11,pc} ; restore all saved permanent 
                                             ; registers and return to caller
KERNELBASE!CreateFileA+0x44:
757a166c f06f0400 mvn         r4,#0          ; r4 = ~0x0 = 0xffffffff = -1 (INVALID_HANDLE_VALUE)
757a1670 e7f8     b           KERNELBASE!CreateFileA+0x3c (757a1664)

FIG#9
Figure 9 : Stack Layout for kernelbase!CreateFileA()

Disassembly Listing

One of the things that will become quickly apparent, when examining ARM disassembly in WinDBG, is that the more often than not the debugger's "uf" command will display the following warning.

0: kd> uf nt!IoCallDriver
Flow analysis was incomplete, some code may be missing

This sections explains why this happens. The ARM compiler generates branches to absolute addresses using instruction sequences similar to the following:

0: kd> uf nt!IoCallDriver
.
.
.
nt!SMKM_STORE_MGR<SM_TRAITS>::SmpPageEvict+0x3b0:
835237f0 f6411c87 mov r12,#0x1987
835237f4 f2c83c55 movt r12,#0x8355
835237f8 4760 bx r12

WinDBG, as of version 6.3.9600, does not pay attention to mov instructions because they do not fall under the category of flow control instructions. WinDBG encounters the bx r12 instruction, and gives up on the static disassembly because it assumes that the value of r12 will be determined at runtime. It however misses the fact that the above sequence amounts to bx 0x83551978 which is nothing but a call to another function, as shown in the figure below:

FIG#6
Figure 6 : Indirect Branch

So any time WinDBG encounters an indirect branch via a register it fails to follow the function in its entirety.

Co-Processor

The ARM CPU has multiple co-processors that implement functionality that is not a part of core instruction execution. The co-processors that are used by Windows, as well other operating systems, are:

The MRC and MCR instructions are used to access the co-processor registers. The VFP (CP10) can be also be accessed using VMSR and VMRS instructions.

The compiler intrinsics MoveFromCoprocessor() and MoveToCoprocessor() and their variants can be used to access ARM co-processors from C/C++. The Visual Studio 2013 CRT source file "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\crt\src\ARM\helpexcept.c" has examples on how to use these intrinsics.

Since the CP15 co-processor contains the most critical registers required by Windows, some details of this co-processor are included in this section. The CP15 registers are organized by function groups with each group represented by a single primary co-processor register referred to as CRn. The function group description and the corresponding primary control register is listed in the table below:

CRnFunctionality
c0ID and Feature Registers
c1System Control Register
c2Translation Table Base
c3Domain Access Control
c5Fault Status
c6Fault Address Register
c7Cache/Write Buffer Control
c8TLB Maintenance Operations
c9Performance Counters
c10Memory Mapping Registers & TLB Operations
c11DMA Control
c12Security Extensions registers
c13Process, Context & Thread ID Registers

The following table contains some examples of CP15 registers that are used by Windows for various low level operations. Individual CP15 registers are selected by the primary co-processor register (CRn), the secondary co-processor register (CRm), OpCode #1 (Op1) and OpCode#2 (Op2).

CP#Opc1CRnCRmOpc2Description
p150c1c00SCTLR System Control Register (Used by KiInitializeExceptionVectorTable to setup exception handling)
p150c2c00TTBR0 Translation Table Base Register 0 (KxSwapProcess writes the Page Table Base Address to this during context switch, similar to X86 CR3)
p150c5c00DFSR Data Fault Status Register (KiDataAbortException uses this to find the type of data fault that occurred)
p150c5c01IFSR Instruction Fault Status Register (KiPrefetchAbortException uses this to find the type of instruction fetch fault that occurred)
p150c6c00DFAR Data Fault Address Register (KiDataAbortException uses this to find the address at which the fault occurred, similar to X86 CR2)
p150c6c02IFAR Instruction Fault Address Register (KiPrefetchAbortException uses this to find the address at which the fault occurred, similar to X86 CR2)
p150c9c130PMCCNTR Cycle Count Register (Used by the compiler intrinsic __rdpmccntr64)
p150c12c00VBAR Vector Base Address Register (base of exception table, contains nt!KiArmExceptionVectors)
p150c13c01CONTEXTIDR Context ID Register (contains Address Space IDentifier i.e. KPROCESS->Asid)
p150c13c02TPIDRURW Thread ID User Read Write (TEB Thread Environment Block)
p150c13c03TPIDRURO Thread ID User Read Only, Privileged Read Write (31:6 KTHREAD, 3:0 IRQL)
p150c13c04TPIDRPRW Privileged Read Write (KPCR Kernel Process Control Region)

A full list of co-processor options is available in [3].

The following code snippet shows the MRC and MCR instructions accessing the contents of the TPIDRUR0 register in CP15 using primary register c13, secondary register c0, OpCode1=0 and OpCode2=3. The MRC instruction reads the contents of TPIDRUR0 into ARM register r0. The MCR instruction writes the contents of ARM register r0 to TPIDRUR0. Figure 10 labels the various operands passed to the MRC instruction.

ee1d0f70 mrc         p15,#0,r0,c13,c0,#3 ; r0 = TPIDRURO
ee0d0f70 mcr         p15,#0,r0,c13,c0,#3 ; TPIDRURO = r0

FIG#10
Figure 10 : Co-processor Register Access

Here are some Windows kernel mode functions that access CP15 registers.

0: kd> uf nt!PsGetCurrentProcess
nt!PsGetCurrentProcess:
83442a18 ee1d3f70 mrc p15,#0,r3,c13,c0,#3 ; R3 = TPIDRURO
83442a1c f033033f bics r3,r3,#0x3F        ; r3 = r3 & ~0x3f
83442a20 6f58     ldr r0,[r3,#0x74]       ; r0 = r3 + 0x74 ; r0 = KTHREAD.ApcState.Process
83442a22 4770     bx lr                   ; return
0: kd> uf hal!KeGetCurrentIrql
hal!KeGetCurrentIrql:
8392c380 ee1d3f70 mrc p15,#0,r3,c13,c0,#3 ; R3 = TPIDRURO
8392c384 f013000f ands r0,r3,#0xF         ; r0 = r3 & 0x0f ; R0 = Irql
8392c388 4770     bx lr                   ; return

Following is a user mode function that accesses CP15. Some of the low level macros like NtGetCurrentTeb() which access CP15 are defined in winnt.h.

0:000> uf kernel32!GetCurrentThreadId
kernel32!GetCurrentThreadId:
77361fd0 ee1d3f50 mrc p15,#0,r3,c13,c0,#2 ; R3 = TPIDRURW
77361fd4 6a58     ldr r0,[r3,#0x24]       ; TEB.ClientId.UniqueThread
77361fd6 4770     bx lr                   ; return

System Calls

The SVC instruction causes a Supervisor Call exception. This provides a mechanism for unprivileged software (user mode applications) to make calls into the operating system (kernel routines). WoA uses this mechanism to implement native system calls similar to the int 0x2e, sysenter and syscall instructions on the X86 and X64 CPUs. In the code snippet shown below the NTDLL native API NtClose uses the SVC #1 instruction to invoke the exception handler for system call exceptions (nt!KiSWIException). This service index for NtClose() is 0x0d. The usage of register r12 to pass the service index into the system call is recommended by the ARM Application Binary Interface (ABI).

0:000> uf ntdll!NtClose
ntdll!NtClose:
77b8e230 f04f0c0d mov r12,#0xD ; r12 = system call identifier
77b8e234 df01     svc #1       ; call into kernel
77b8e236 4770     bx lr        ; return to caller

WoA uses a system service dispatch table similar to the one on X64. The kernel variable nt!KiServiceTable points to a table that contains 32 bit entries each containing a 28 bit relative service offset and a 4 bit argument count. The kernel initialization function nt!KeCompactServiceTable() sets up the table. The logic ServiceAddress = KiServiceTable + KiServiceTable[ServiceIndex] >> 4 ) computes the address of the function that implements the native service. The "return from exception" instruction (i.e. RFE sp) transfers execution back to user mode.

The following example shows the address of the function nt!NtClose being computed relative to the base of the table at nt!KiServiceTable using the service index 0x0d.

0: kd> u nt!KiServiceTable + ( poi(nt!KiServiceTable + (d * 4)) >> 4 )
nt!NtClose+0x1:
8364a924 e92d4ff0 push        {r4-r11,lr}
8364a928 f20d0b1c addw        r11,sp,#0x1C
8364a92c f695ffea bl          nt!_security_push_cookie (834e0904)
8364a930 b08a     sub         sp,sp,#0x28
8364a932 4605     mov         r5,r0
8364a934 ee1d3f70 mrc         p15,#0,r3,c13,c0,#3
8364a938 f033033f bics        r3,r3,#0x3F
8364a93c f993815a ldrsb       r8,[r3,#0x15A]

Exception Handling

On the X86/X64 CPU the Interrupt Descriptor Table (IDT) contains pointers to exception handlers, software interrupt handlers and hardware interrupt handlers. On the ARM CPU, has a separate exception vector table that contains instruction opcodes instead of function pointers. The opcode for each type of exception in the table is the same (0xf8dff01c) and it encodes an instruction that will transfer execution control to the PC relative offset to the handler for that exception. As a part of system startup, the kernel function nt!KiInitializeExceptionVectorTable() writes the address of the Windows exception vector table (nt!KiArmExceptionVectors) to the Vector Base Address Register (VBAR) in CP15. The ARM exception table along with the registered exception handlers is shown below.

0: kd> u nt!KiArmExceptionVectors
nt!KiArmExceptionVectors:
834dc6a0 f8dff01c ldr pc,=0xFFFFFFFF ; [nt!KiArmExceptionVectors+0x20(834dc6c0)]
834dc6a4 f8dff01c ldr pc,=nt!KiUndefinedInstructionException+0x1 (834dade1) ; [nt!KiArmExceptionVectors+0x24 (834dc6c4)]
834dc6a8 f8dff01c ldr pc,=nt!KiSWIException+0x1 (834db941) ; [nt!KiArmExceptionVectors+0x28 (834dc6c8)]
834dc6ac f8dff01c ldr pc,=nt!KiPrefetchAbortException+0x1 (834db001) ; [nt!KiArmExceptionVectors+0x2c (834dc6cc)]
834dc6b0 f8dff01c ldr pc,=nt!KiDataAbortException+0x1 (834db161) ; [nt!KiArmExceptionVectors+0x30 (834dc6d0)]
834dc6b4 f8dff01c ldr pc,=0xFFFFFFFF ; [nt!KiArmExceptionVectors+0x34(834dc6d4)]
834dc6b8 f8dff01c ldr pc,=nt!KiInterruptException+0x1 (834db601) ; [nt!KiArmExceptionVectors+0x38 (834dc6d8)]
834dc6bc f8dff01c ldr pc,=nt!KiFIQException+0x1 (834db721) ; [nt!KiArmExceptionVectors+0x3c (834dc6dc)]

FIG#11
Figure 11 : ARM Exception Table

On the X86 and X64 there is a single exception handler that handles all types of page faults. On the ARM CPU there are two different handlers one for data page faults (nt!KiDataAbortException) and another one for code page faults (nt!KiPrefetchAbortException). Both these exception handlers call the common routine nt!KiCommonMemoryManagementAbort to perform the bulk of page fault handling.

Fast IRQ handling is not supported on the WoA platform. Examining the implementation of the FIQ exception handler (nt!KiFIQException) shows that this function if ever called would bug-check the system with the stop code 0x3d (INTERRUPT_EXCEPTION_NOT_HANDLED).

0: kd> uf KiFIQException
nt!KiFIQException:
834db720 e98dc011 srs         sp,#0x11
834db724 e9cd4502 strd        r4,r5,[sp,#8]
834db728 466c     mov         r4,sp
.
.
.
nt!KiFIQException+0x10c:
834db82c 203d     movs        r0,#0x3D
834db82e 2100     movs        r1,#0
834db830 2200     movs        r2,#0
834db832 2300     movs        r3,#0
834db834 468c     mov         r12,r1
834db836 f000fa51 bl          nt!KiBugCheckDispatch (834dbcdc)
834db83a defe     __debugbreak

Interrupt Descriptor Tables

On the X86/X64 CPU, drivers register their interrupt service routines (ISRs) through a system provided template directly in the interrupt descriptor table (IDT). ARM platforms that have a generic interrupt controller (GIC) do not support vectored interrupts. So WoA routes all hardware interrupts through a single entry point (nt!KiInterruptException) which is responsible for determining the source of the interrupt from the GIC and then dispatching the interrupt to the appropriate driver's ISR.

Similar to the X64 CPU, WoA uses a total of 16 IRQLs. The IRQLs associated with hardware devices are in the range 0x8 through 0xb. For each device IRQL, the first 16 device interrupts at that IRQL are registered directly in the KPCR->Idt[] array. Any overflow interrupts i.e. beyond the 16 interrupts per device IRQL, are registered in the KPCT->IdtExt[] array. The function KiConnectInterruptInternal() determines if there is an overflow situation and accordingly allocates the extended IDT at KPCT->IdtExt from NonPagedPool with 0x400 entries. Both the primary IDT (KPCR->Idt[]) and the extended IDT (KPCR->IdtExt[]) contain pointers to KINTERRUPT structures that were allocated as a result of drivers registration of their ISR.

The following debugger commands show one such KINTERRUPT structure.

0: kd> !pcr
KPCR for Processor 0 at 835df000:
    Major 1 Minor 1
Panic Stack 00000000
Dpc Stack 00000000
Irql addresses:
    Mask    835df000
    Table   835df000
    Routine 835df000

0: kd> dt nt!_KPCR 835df000 -a Idt
   +0x12c Idt : 
.
.
.
    [128] 0x8fb38a80 Void
    [129] 0x8fb38880 Void
    [130] 0x8e723980 Void
    [131] 0x8e723600 Void
    [132] 0x8e723900 Void
    [133] 0x8e723200 Void
    [134] 0x8e723e00 Void
    [135] 0x8e723e80 Void
.
.
.
    [144] 0x8fb38b00 Void
    [145] 0x8fb38900 Void
    [146] 0x8e723a80 Void
    [147] 0x8e723680 Void
    [148] 0x8e723a00 Void
    [149] 0x8e723500 Void
    [150] 0x8e723300 Void
    [151] 0x8e723f00 Void
.
.
.

0: kd> dt nt!_KINTERRUPT 0x8fb38a80
   +0x000 Type             : 0n22
   +0x002 Size             : 0n112
   +0x004 InterruptListEntry : _LIST_ENTRY [ 0x8fb38a84 - 0x8fb38a84 ]
   +0x00c ServiceRoutine   : 0x90026211     unsigned char  dxgkrnl!DpiFdoLineInterruptRoutine+0
   +0x010 MessageServiceRoutine : (null) 
   +0x014 MessageIndex     : 0
   +0x018 ServiceContext   : 0x87b1d768 Void
   +0x01c SpinLock         : 0
   +0x020 TickCount        : 0
   +0x024 ActualLock       : 0x877cb360  -> 0
   +0x028 DispatchAddress  : (null) 
   +0x02c Vector           : 0x800
   +0x030 Irql             : 0x8 ''
   +0x031 SynchronizeIrql  : 0xb ''
   +0x032 FloatingSave     : 0 ''
   +0x033 Connected        : 0x1 ''
   +0x034 Number           : 0
   +0x038 ShareVector      : 0x1 ''
   +0x03a ActiveCount      : 0
   +0x03c InternalState    : 0n0
   +0x040 Mode             : 0 ( LevelSensitive )
   +0x044 Polarity         : 0 ( InterruptPolarityUnknown )
   +0x048 ServiceCount     : 0
   +0x04c DispatchCount    : 0
   +0x050 PassiveEvent     : (null) 
   +0x054 TrapFrame        : 0x82b30d30 _KTRAP_FRAME
   +0x058 DispatchCode     : [4] 0
   +0x068 DisconnectData   : (null) 
   +0x06c ServiceThread    : (null) 

Unlike the X86/X64 where the IDT is a hardware defined structure, on the ARM CPU the IDT is software defined. This has an interesting security benefit in that the KINTERRUPT structure on ARM no longer needs to contain any executable code, as can be observed from the size of the KINTERRUPT.DispatchCode[] array in the above output, and hence it can be allocated out of Non-Executable NonPagedPool.

In addition to the primary and extended IDTs described above, WoA also uses a global secondary IDT for General Purpose I/O (GPIO) interrupts. This IDT is allocated from non-paged pool and is pointed to by the global variable nt!KiGlobalSecondaryIDT. Each entry in this table is of type KSECONDARY_IDT_ENTRY which contains an embedded KINTERRUPT structure as shown below. The current implementation allocates the secondary IDT with 0x100 entries.

0: kd> db nt!KiSecondaryInterruptServicesEnabled L1
835c0ad7  01    

0: kd> ? poi (nt!KiGlobalSecondaryIDT)
Evaluate expression: -2037473280 = 868ea000

0: kd> dt 868ea000  nt!_KSECONDARY_IDT_ENTRY
   +0x000 SpinLock         : 0
   +0x004 ConnectLock      : _KEVENT
   +0x014 LineMasked       : 0 ''
   +0x018 InterruptList    : 0x8fb38e80 _KINTERRUPT

0: kd> dt 0x8fb38e80 nt!_KINTERRUPT
   +0x000 Type             : 0n22
   +0x002 Size             : 0n112
   +0x004 InterruptListEntry : _LIST_ENTRY [ 0x8fb38e84 - 0x8fb38e84 ]
   +0x00c ServiceRoutine   : 0x8c7ab761     unsigned char  portcls!CInterruptSync::GetKInterrupt+0
   +0x010 MessageServiceRoutine : (null) 
   +0x014 MessageIndex     : 0
   +0x018 ServiceContext   : 0x87d92d68 Void
   +0x01c SpinLock         : 0
   +0x020 TickCount        : 0
   +0x024 ActualLock       : 0x87d92d98  -> 0
   +0x028 DispatchAddress  : (null) 
   +0x02c Vector           : 0x1000
   +0x030 Irql             : 0xb ''
   +0x031 SynchronizeIrql  : 0xb ''
   +0x032 FloatingSave     : 0 ''
   +0x033 Connected        : 0x1 ''
   +0x034 Number           : 0
   +0x038 ShareVector      : 0x1 ''
   +0x03a ActiveCount      : 0
   +0x03c InternalState    : 0n0
   +0x040 Mode             : 1 ( Latched )
   +0x044 Polarity         : 0 ( InterruptPolarityUnknown )
   +0x048 ServiceCount     : 0
   +0x04c DispatchCount    : 0
   +0x050 PassiveEvent     : (null) 
   +0x054 TrapFrame        : (null) 
   +0x058 DispatchCode     : [4] 0
   +0x068 DisconnectData   : (null) 
   +0x06c ServiceThread    : (null)

As of WinDBG v6.3.9600, the debugger's "!idt" and "!idt -a" commands display all of the 3 IDTs mentioned above, but only expand the entries in the secondary IDT as shown below:

0: kd> !idt -a

Dumping IDT: 835df12c

Dumping Extended IDT: 00000000

Dumping Secondary IDT: 868ea000 

1000:portcls!CInterruptSync::GetKInterrupt+0x20 (KINTERRUPT 8fb38e80)

1002:mbtu97w8arm+0x358c (KMDF) (KINTERRUPT 8fb38f00)

1003:hidi2c!OnInterruptIsr (KMDF) (KINTERRUPT 8e723180)

1004:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723100)

1005:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723080)

1006:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8e723000)

1007:SurfaceHomeButton+0x28bc (KMDF) (KINTERRUPT 8fb38f80)

1008:hidi2c!OnInterruptIsr (KMDF) (KINTERRUPT 8e723280)

100a:nvthml+0x2ebc (KMDF) (KINTERRUPT 8fb38d00)

100b:sdbus!SdbusGpioInterrupt (KINTERRUPT 8fb38b80)

Conclusion

This article described the ARM CPU, registers and Thumb-2 instructions. It explained the functionality of the instructions typically seen in code generated by the Visual Studio compiler as well as details of the function calling convention. It covered some unique aspects of the ARM CPU like the barrel shifter, the co-processors, and explicit opcodes used for memory barriers and undefined instructions, while also explaining how such aspects are used by Windows. This article also highlighted some of the key differences between how certain features like trap frames, exception handling, interrupt dispatching, interrupt descriptor tables, system calls and interlocked operations are implemented on ARM as compared to X86/X64.

Special thanks to Alex Ionescu (@aionescu) for his review and valuable feedback on this article.

References

[1]ARM Intrinsics : MSDN
[2]ARM Assembler Reference : MSDN
[3]ARM Cortex-A9 Technical Reference Manual
[4]ARM Architecture Reference Manual - ARMv7-A and ARMv7-R edition (Login Required)
[5]Introduction to ARM
[6]Whirlwind Tour of ARM Assembly
[7]NT Debugging Blog : Understanding ARM Assembly Part 1
[8]Windows feature lets you generate a memory dump file by using the keyboard
[9]New Security Assertions in Windows 8