Format String Vulnerabilities

Perhaps one of the most interesting errors that we discovered was a result of an unusual interaction of two parts of csh, along with a little careless programming. The following string will cause the VAX version of csh to crash: !o%8f and the following string: !o%888888888f.

Format functions in C take as input a "format string" followed by zero or more arguments. The format string tells the function how to compose the arguments into an output string; depending on the format function, the output string may be written to a file, the standard output location, or a buffer in memory.

Format string problems, the cause of the errors in the above quote, were a curiosity in 1990 when those words were published. By 1999, format string problems were recognized as a security problem, and they were being actively exploited by 2000.

The canonical example of a format function is printf:

char *s = "is page"; int n = 125; printf("Hello, world!"); printf("This %s %d.", s, n);

The first call to print f prints Hello, world!; its format string doesn't contain any special directives telling printf to look for any additional arguments. The second call, on the other hand, does - %S says to interpret the next unread argument (s) as a pointer to a string, and %d treats the next unread argument (n) as an integer.

The result is the output: This is page 125.

Saying "the next unread argument" implies that printf consumes the arguments as it formats the output string, and this is exacdy what happens. Figure 1 shows the stack layout for a call to printf, assuming again that arguments are passed on the stack.

As a format function reads its arguments, it effectively steps a pointer through the stack, where the pointer identifies the next argument to be read. Format functions exhibit a touching faith in the correctness of the format string.

A format function has no way of knowing how many arguments were really passed by its caller, which can be disastrous if an attacker is able to supply any part of a format string.

For example, if the program contains: printf (error) ;

and an attacker manages to set the variable error to "yosyosyosyos", then the program will almost certainly crash. Printf's attacker-specified format string tells it to grab the next four items off the stack and treat each one as a pointer to a string.

The problem is that the next four items on the stack aren 7 pointers to strings, so printf will make wild memory references in an effort to format its alleged strings.

As is, this attack can be used to print out the contents of a target program's stack: an attacker can craft a format string which walks up the stack, interpreting each stack item as a number and printing the result.

Changing error to "yodyodyodyod" in the above example would be enough to print the stack contents. This is one possible way that addresses can be discovered for a later stack smashing attack. Even more is possible if the attacker can control a format string located in the stack.

The code below is a common scenario, where a buffer's contents are formatted for later output.

void print _ e r r o r (char*s) { char buffer[123]; snprintf(buffer, sizeof(buffer), ''Error: %s", s) ; printf(buffer); }

The snprintf function is a format function with two additional arguments, a buffer and the buffer's length; snprintf writes its formatted output to this buffer. It also demonstrates that a format string vulnerability can be exploited indirectly, as the flaw here is in the call to printf, not snprintf.

With this code, the attacker's format string can be the ungainly construction

"\x78\x56\x34\x12 %d%d%d%d%d%d%d %n"|

The buffer, a local variable on the stack, contains p r i n t f s format string after the call to snprintf. Printf is thus called with a format string that the attacker has supplied in part:

"Error: \x78\x56\x34\x12 %d%d%d%d%d%d%d %n"|

There are four parts to this format string.

  1. Error: is added by snprintf. It plays no part in this attack and can be ignored.
  2. \x78\x56\x34\x12 is the address 12345678 in little-endian format; in C strings, \x introduces a pair of hexadecimal digits.
  3. %d%d%d%d%d%d%d , used as mentioned above to walk up the stack's contents.
  4. %n is a format string directive. It tells printf to interpret the next unread argument as a pointer to an integer. Printf writes the number of bytes it's formatted so far into the pointed-to integer. Through this mechanism, the attacker has a way to have a value written to an arbitrary memory location.

The stack layout during an attack is given in Figure 2.

The attacker's format string causes printf to walk up the stack, printing integers, until the next unread argument is the address the attacker encoded in the format string. (Remember that the buffer is in the stack, so the attacker's format string is there too.)

The %n takes the attacker's address and writes a number at that address. The attacker can control the number written by adding junk characters to the format string, changing the number of bytes printf formats, and consequently the number written for %n.

Like other attacks, if an attacker can make a single specified value change, then the possibility of running shellcode exists.

Defenses

The underlying moral in studying these technical vulnerabilities is to never, ever, ever trust input to a program. Having bulletproof input routines and bug free code is the best defense to technical vulnerabilities, but expecting this of all software is like asking Santa Claus for world peace - well intentioned, but unlikely to happen in the near future.

In the meantime, two types of defenses can be considered, ones that are specific to a type of vulnerability, and ones that are more general.

Vulnerability-Specific Defenses

Defenses can be directed to guarding against certain types of vulnerability. For example:

Format string vulnerabilities:

  • Source code auditing is a particularly effective defense, because the number of format functions is relatively small, and it is easy to search source code for calls to format functions.
  • Remove support for yoii in format functions, or only allow constant format strings that an attacker can't change. This defense would break existing code in addition to violating the C specification.
  • If a format function knew how many arguments it had been called with, then it could avoid reading nonexistent arguments. Unfortunately, this information isn't available at run-time. A program's source code can be altered to supply this information.

Calls to known format functions can be wrapped in macros that keep track of the number of arguments passed. Even this doesn't always work, because nonstandard format functions may be used, or standard format functions may be used in unusual ways. For example, the code may save a function pointer to printf and call it later, rather than calling printf directly.

Stack smashing:

  • As mentioned before, one defense against stack smashing is to mark the stack's memory as nonexecutable; the same idea can be extended to the data and heap segments. This is not a complete defense, since a return-to-library attack is still possible, but it does close one attack vector.

Some programs legitimately need to have executable code in odd places in memory, like just-in-time compilers and nested C functions. An alternative memory protection approach ensures that memory pages can be writable or executable, but not both at the same time. This provides the same protection, but with more flexibility for legitimate programs.

  • The control information in the stack, the return address and the saved frame pointer, can be guarded against inappropriate modification. This method prevents stack smashing attacks, and also catches some buggy programs. The way the control information is guarded is by using canaries.

Miners used to use live canaries as a safety precaution. A buildup of toxic gases in a mine would kill a canary before a human, so canaries were taken down into mines as an early-warning system. Finding a metabolically-challenged canary meant that it was time for a coffee break on the surface.

For stack smashing defense, a canary is a value which is strategically located in the stack frame, between the local variables and the control information. A canary can't withstand an attack- in theory - and if the canary is corrupted, then an attack may have occurred, so the program should issue an alert and exit immediately.

General Defenses

Since most of the technical vulnerabilities stem from the use of programming languages with weaknesses, like the lack of bounds checking, one general approach is to stop using those languages.

No more C, no more C++. This suggestion ignores many realities: legacy code, programmer training, programmer and management biases towards certain programming languages, the cost and availability of tools and compilers, constraints from third-party libraries.

In any case, even if use of "weak" programming languages was stopped, history has shown that existing applications in those languages would linger in active use for decades. A related idea is not to change programming languages, but to repair problems with an existing language after the fact.

For example, bounds checking could be added to C programs. Current approaches to bounds checking C code are dogged by problems: incomplete protection, breaking existing code. This is also an area where adding 'less than 26%' overhead is deemed to make a tool practical for use.

A more feasible defense is to randomize the locations of as many addresses as possible. If the locations of the stack, shared libraries, program code, and heap-allocated memory change each time the program is run, then an attacker's task is made more difficult.

However, it also makes legitimate debugging more difficult, in terms of finding spurious bugs, if these locations change nondeterministically. There is also evidence that the amount of randomization that can be provided is insufficient to prevent attacks completely.

A brute-force attack on a well-chosen target is possible, albeit much slower than attacking a system without any randomization. A program's code can also be monitored as it runs, akin to behavior blocking anti-virus techniques.

The monitoring system looks for potential attacks by watching for specific abnormal behaviors, like a function return jumping into a buffer, or a return instruction not returning to its call site.

The tricky part is pausing the monitored program's execution at critical points so that checks may be performed, without introducing excessive overhead, without modifying the monitored program's code. A solution comes in the form of caching:

  • The monitoring system maintains a cache of code chunks that have already been checked against the monitor's security policy.
  • Cached code chunks run directly on the CPU, rather than using slow emulation, and a chunk returns control back to the monitor when it's done running.
  • Each control transfer is checked - if the destination corresponds to an already-cached code chunk, then execution goes to the cached chunk. Otherwise, the destination code chunk is checked for security violations and copied into the code cache.

Code chunks in the cache can be optimized, mitigating some of the monitoring overhead.