Low Level Programming Languages & C Programming
Note
These notes are mostly there, but may miss a couple of bits from the slides...
Higher level languages use the compiler or interpreter to force programmers to program safely. C is a very unsafe programming language, allowing the programmer to do just about anything they want with the hardware, with very few restrictions.
C was the first mainstream language that was built on top of assembly, which was architecture specific, and was designed to be very portable. The idea that a computer would be used maliciously was not even considered when the language was conceived.
The entire software stack is traditionally built in C, with change to newer parts being very costly to rewrite.
Hijacking
Buffer Overflow
Here is a simple C program:
#include <stdio.h>
#include <string.h>
int main(int argc, char** argv) {
char buffer[5];
strcpy(buffer, argv[1]);
return 0;
}
This program has a buffer overflow vulnerability. We assign a fixed sized buffer to the variable, then copy argv
into the buffer. If argv
is \(>500\) bytes in length, then the strcpy
function will gladly continue to write memory out of bounds for us.
This works fine for us, up to 5 characters in the input:
but if we have more than 5 characters, suddenly we encounter issues:$ ./a.out 111111
*** stack smashing detected ***: terminated
[1] 26613 IOT instruction (core dumped) ./a.out 111111
This is now normally managed by software, but ARM is now working on Capability Hardware Enhanced RISC Instructions (CHERI), which includes hardware to prevent out of bounds access and writes.
Memory Layout
The physical view of the memory layout is managed by the operating system, and includes paging, caching, swapping, etc. In this module, we simply look at the logical view of the memory layout:
- Kernel -- High Addresses
- Stack
- Empty Space (Hopefully!)
- Heap
- Data
- Text -- Low Address
The heap is memory that we allocate with functions such as malloc
, or other objects, whilst the stack is used for storing local variables, and each frame stores the current function, return addresses to the previous stack pointer, and other aspects of the stack.
In a language like Java, these objects are destroyed as part of the garbage collections routines. We can also explicitly free space on the heap.
The heap is dynamically allocated, compared to the stack which is allocated at compile time.
The stack grows downwards, but the heap doesn't really grow as much, instead taking a "shovel of space" and we don't really need to worry about the heap other than address ranges that are allocated.
Stack Frames
When we invoke a new function, we push a new "stack frame" onto the stack, including the parameters of the function, then the return address, and any registers that are "caller-owned" and thus need to be saved.
The compiler counts all the variables declared, then decides how much space is needed for all the variables. The stack pointer (todo %rsp?) is then incremented by the right number of addresses to make space for the function arguments and variables.
We also have the notion of the base pointer (%rbp
). This is caller owned, so needs to be pushed onto the stack, then popped at the end of the function. The base pointer shows the address at which the function memory starts, and using any memory above the base pointer is probably bad, as any memory above belongs to the caller.
The stack pointer will move around as we add variables to the stack, etc., but the base pointer will always point to the base of the allocated space.
Beyond Program Crashes
A program crash shows us that there is an issue with the code. If we are evil, then we can craft an input to allow us to execute arbritary code instead of just crashing the program. If we can execute our own code then the possibilities are much greater for attacks.
Invoking a Program
The convention by the compiler is that the program must be invoked by first calling main()
, with some possible arguments.
We can either invoke main with int main(void)
or int main(int argc, char** argv)
. Every function returns a value, which can be ignored. When returning from main
in C, we use 0 to define a good exit code (i.e., no issues), and another code otherwise.
The original philosophy of C was that it is simple, compared to previous languages that were invented. For example, we don't actually need to write a return statement. This allows the programmer less complexity when programming but makes it more dangerous.
#include
is called a directive. This is not a part of the program. It is not an instruction for the program but an instruction for the compiler, telling the compiler what to do. stdio.h
is a header file. When the compiler reads this file, it knows that the library function exists. The implementation of the function is not in this file, merely the declaration.
The linker finds all unresolved names in the file, e.g., printf
, then goes into the libraries and finds how to find the references to the unknown symbols.
The standard doesn't enforce a standard length for an integer. From the perspective of writing a program, the difference in the size of the integer might cause problems if we have an integer that overflows on some hardware and doesn't on others.
#include<stdio.h>
void main() {
unsigned char a = 255;
signed char b = 128;
unsigned int c = 4445;
signed int d = 5543;
printf("a %d, b %d, a %c, b %c, c %d, d %d, size %d", a,b,a,b,c,d,(int)sizeof(c));
}
This gives a 255, b -128, a <ff>, b <80>, c 4445, d 5543, size 4
, as on the computer used for testing, int
is 4 bytes long. Also representing the values it gives the character as an ASCII representation.
Sizes of different datatypes are given in limits.h
. float.h
for float limits. Can use sizeof
to check the size on the processor.
Type Casting
We can use (int) sizeof(int)
to cast the datatype of the variable or return value of the function to a specified datatype. For a character char c = 'b'
, we can get the ASCII for it with (int) c
.
For char
s, we can use 'c', '\t', '\u02C0', 99
, 85 for decimal 076
for octal, and 0xfa
for hexadecimal.
We can show unsignedness of an integer with 30u
, 30l
for long, 30ul
for unsigned long, 3.1f, .58f, 123e4f
for floatds, and 3.14, .58d, 123e4
for doubles.