Another way of classifying viruses is by how they try to conceal themselves, both from users and from anti-virus software.
Not hiding at all is one concealment strategy which is remarkably easy to implement in a computer virus. It goes without saying, however, that it's not very effective - once the presence of a virus is known, it's trivial to detect and analyze.
With an encrypted virus, the idea is that the virus body (infection, trigger, and payload) is encrypted in some way to make it harder to detect. This "encryption" is not what cryptographers call encryption; virus encryption is better thought of as obfuscation. (Where it's necessary to distinguish between the two meanings of the word, I'll use the term "strong encryption" to mean encryption in the cryptographic sense.)
When the virus body is in encrypted form, it's not runnable until decrypted. What executes first in the virus, then, is a decryptor loop, which decrypts the virus body and transfers control to it. The general principle is that the decryptor loop is small compared to the virus body, and provides a smaller profile for antivirus software to detect.
A decryptor loop can decrypt the virus body in place, or to another location; this choice may be dictated by external constraints, like the writability of the infected program's code. This example shows an in-place decryption. How is virus encryption done? Here are six ways:
- Simple encryption. No key is used for simple encryption, just basic parameterless operations, like incrementing and decrementing, bitwise rotation, arithmetic negation, and logical NOT.
- Static encryption key. A static, constant key is used for encryption which doesn't change from one infection to the next. The operations used would include arithmetic operations like addition, and logical operations like XOR.
- Variable encryption key. The key begins as a constant value, but changes as the decryption proceeds.
- Substitution cipher. A more general encryption could employ lookup tables which map byte value between their encrypted and decrypted forms.
- Strong encryption. There is no reason why viruses cannot use strong encryption. Previously, code size might have been a factor, if the virus would have to carry strong decryption code with it, but this is no longer a problem: most systems now contain strong encryption libraries which can be used by Viruses.
The major weakness in the encryption schemes above is that the encrypted virus body is the same from one infection to the next. That constancy makes a virus as easy to detect as one using no concealment at all!
With random encryption keys, this error is avoided: the key used for encryption changes randomly with each new infection. This idea can be applied to any of the encryption types described here. Obviously, the virus' decryptor loop must be updated for each infection to incorporate the new key.
A stealth virus is a virus that actively takes steps to conceal the infection itself, not just the virus body. Furthermore, a stealth virus tries to hide from everything, not just anti-virus software. Some examples of stealth techniques are below:
- An infected file's original timestamp can be restored after infection, so that the file doesn't look freshly-changed.
- The virus can store (or be capable of regenerating) all pre-infection information about a file, including its timestamp, file size, and the file's contents. Then, system I/O calls can be intercepted, and the virus would play back the original information in response to any I/O operations on the infected file, making it appear uninfected.
This technique is applicable to boot block I/O too. The exact method of intercepting I/O calls depends on the operating system. Under MS-DOS, for instance, I/O requests are made with interrupt calls, whose handlers are located via user-accessible interrupt vectors.
The virus need only modify the interrupt vector to insert itself into the chain of interrupt handlers. On other systems, I/O is performed using shared libraries, so a virus can impose itself into key shared library routines to intercept I/O calls for most applications.
- Some systems store the secondary boot loader as consecutive disk blocks, to make the primary boot loader's task simpler. On these systems, there are two views of the secondary boot loader, as a sequence of blocks, and as a file in the filesystem.
A virus can insert itself into the secondary boot loader's blocks, relocating the original blocks elsewhere in the filesystem. The end result is that the usual, filesystem view shows no obvious changes, but the virus is hidden and gets run courtesy of the real primary boot loader.
A variation is a reverse stealth virus, which makes everything look infected - the damage is done by anti-virus software frantically (and erroneously) trying to disinfect.
Stealth techniques overlap with techniques used by rootkits, Rootkits were originally toolkits for people who had broken into computers; they used these toolkits to hide their tracks and avoid detection. Malware now uses rootkits too: for example, the Ryknos Trojan horse tried to hide itself using a rootkit intended for digital-rights management.
Assuming an encrypted virus' key is randomly changed with each new infection, the only unchanging part of the virus is the code in the decryptor loop. Anti-virus software will exploit this fact for detection, so the next logical development is to change the decryptor loop's code with each infection.
An oligomorphic virus, or semi-polymorphic virus, is an encrypted virus which has a small, finite number of different decryptor loops at its disposal. The virus selects a new decryptor loop from this pool for each new infection. For example, Whale had 30 different decryptor variants, and Memorial had 96 decryptors.
In terms of detection, oligomorphism only makes a virus marginally harder to spot. Instead of looking for one decryptor loop for the virus, anti-virus software can simply have all of the virus' possible decryptor loops enumerated, and look for them all.
A polymorphic virus is superficially the same as an oligomorphic virus. Both are encrypted viruses, both change their decryptor loop on each infection. However, a polymorphic virus has, for all practical purposes, an infinite number of decryptor loop variations. Tremor, for example, has almost six billion possible decryptor loops!
Polymorphic viruses clearly can't be detected by listing all the possible combinations. There are two questions that arise with respect to polymorphic viruses. First, how can a virus detect that it has previously infected a file, if its presence is hidden sufficiently well? Second, how does the virus change its decryptor loop from infection to infection?
At first glance, it might seem easy for a polymorphic virus to detect if it has previously infected some code - when the virus morphs for a new infection, it can also change whatever aspect of itself that it looks for.
This doesn't work, though, because a virus must be able to recognize infection by any of its practically-infinite forms. This means that the infection detection mechanism must be independent of the exact code used by the virus:
- File timestamp. A virus could change the timestamp of an infected file, so that the sum of its time and date is some constant value K for all infections. A lot of software only displays the last two digits of the year, so an infected file's year could be increased by 100 without attracting attention.
- File size. An infected file could have its size padded out to some meaningful size, such as a multiple of 1234.
- Data hiding. In complex executable file formats, like ELF, not all parts of the file's information may be used by a system. A virus can hide a flag in unused areas, or look for an unusual combination of attributes that it has set in the file. For example, Zperm looks for the character "Z" as the minor linker version in an executable's file header on Windows.
- Filesystem features. Some filesystems allow files to be tagged with arbitrary attributes, whose existence is not always made obvious. These can be used by a virus to store code, data, or flags which indicate that a file has been infected.
- External storage. The indication that a file is infected need not be directly associated with the file itself. For example, a virus could use a hash function to map an infected file's name into an obfuscated string, and use that string to create a key in the Windows Registry.
The virus could then use the existence of that key as an infection indicator. Even if the Registry key was discovered, it wouldn't immediately reveal the name of the infected file (especially if a strong cryptographic hash function was used).
Note that none of these mechanisms need to work perfectly, because a false positive only means that the virus won't infect some code that it might have otherwise. Also, since all these infection-detection methods work for polymorphic viruses, they also work for the more specific case of non-polymorphic viruses too.
Viruses which retain some constancy can just look for one or two bytes of their own code, rather than resorting to more elaborate methods. It was once suggested that systems could be inoculated against specific viruses by faking the virus' self-detection indicator on an uninfected system. Unfortunately, there are too many viruses now to make this feasible.
Changing the Decryptor Loop
The code in a polymorphic virus is transformed for each fresh infection using a mutation engine. The mutation engine has a grab-bag of code transformation tricks which take as input one sequence of code and output another, equivalent, sequence of code.
Choosing which technique to apply and where to apply it can be selected by the engine using a pseudo-random number generator. The result is an engine which is extensible and which can permute code in a large number of ways. Some sample transformations are shown below.
- Instruction equivalence. Especially on CISC architectures like the Intel x86, there are often many single instructions which have the same effect.
- Instruction sequence equivalence. Instruction equivalence can be generalized to sequences of instructions. While single-instruction equivalence is at the mercy of the CPU's instruction set, instruction sequence equivalence is more portable, and applies to both high-level and low-level languages.
- Instruction reordering. Instructions may have their order changed, so long as constraints imposed by inter-instruction dependencies are observed.
- Register renaming. A minor, but significant, change can be introduced simply by changing the registers that instructions use. While this makes no difference from a high-level perspective, such as a human reading the code, renaming changes the bit patterns that encode the instructions; this complicates matters for anti-virus software looking for the virus' instructions.
- Reordering data. Changing the locations of data in memory will have a similar effect in terms of altering instruction encoding as register renaming. This would not necessarily have a corresponding transformation in a high-level language, as the variable names themselves would not be changed, just their order.
- Making spaghetti. Although some programmers are naturally gifted when it comes to producing "spaghetti code," others are not as fortunate. Happily, code can be automatically transformed so that formerly-consecutive instructions are scattered, and linked together by unconditional jumps.
- Inserting junk code. "Junk" computations can be inserted which are inert with respect to the original code - in other words, running the junk code doesn't affect what the original code does.
- Run-time code generation. One way to transform the code is to not have all of it present until it runs. Either fresh code can be generated at run time, or existing code can be modified.
- Interpretive dance. The way code is executed can be changed, from being directly executed to being interpreted by some application-specific virtual machine. A "classical" interpreter for such virtual machine code mimics the operation of a real CPU as it fetches, decodes, and executes instructions.
- Concurrency. The original code can be separated into multiple threads of execution, which not only transforms the code, but can greatly complicate automatic analysis.
- Inlining and outlining. Code inlining is a technique normally employed to avoid subroutine call overhead, that replaces a subroutine call with the subroutine's code.
- Subroutine interleaving. Inlining and outlining transformations maintain the original code, but rebundle it in different ways. Code can also be transformed by combining independent subroutines together.
A number of these transformations are also used in the (legitimate) field of code obfuscation; code obfuscation research is used to try and prevent reverse engineering. There are also many, many elaborate code transformations performed by optimizing compilers.
Not all compiler techniques and code obfuscation techniques have yet been used by virus writers. Instead of supplying transformations for the mutation engine to pick from, a virus writer may create a mutation engine that will automatically produce a distinct, equivalent decryptor loop.
In compilers, automatically searching for a code sequence is referred to as superoptimization, and the search may be implemented in a variety of ways: brute-force, automated theorem proving, or any technique for searching a large search space. Zellome, for example, uses a genetic algorithm in its mutation engine.
Enormous computational demands are required by such a search, although a clever algorithm may avoid generating too much illegal code and thus improve search time. For now, this mutation method is a curiosity only.
Metamorphic viruses are viruses that are polymorphic in the virus body.^^^ They aren't encrypted, and thus need no decryptor loop, but avoid detection by changing: a new version of the virus body is produced for each new infection. The code-modifying techniques used by polymorphic viruses all apply to metamorphic viruses.
Both employ a mutation engine, except a polymorphic virus need not change its engine on each infection, because it can reside in the encrypted part of the virus. In contrast, a metamorphic virus' mutation engine has to morph itself anew for each infection.
Some metamorphic viruses are very elaborate. Simile's mutation engine, about 12,000 lines of assembly code, translates Simile from machine code to a machine-independent intermediate code.
Operating on the intermediate code, the mutation engine undoes old obfuscations, applies new transformations, and generates fresh machine code. Metamorphic mutation engines whose input and output are machine code must be able to disassemble and reassemble machine code.
Metamorphism is relatively straightforward to implement in viruses that spread in source code form, such as macro viruses. A virus may rely on system tools for metamorphism, too.
Apparition, for instance, is written in Pascal and carries its own source code; if a compiler is found on an infected system, the virus inserts junk code into its source and recompiles itself.
While polymorphic and metamorphic viruses are decidedly nontrivial to detect by anti-virus software, they are also hard for a virus writer to implement correctly - the numbers of these viruses are small in comparison to other types.
The encryption methods discussed so far result in viruses that, once captured, are susceptible to analysis. The major problem is not the encryption method, because that can always be strengthened; the major problem is that viruses carry their decryption keys with them. This might seem a necessary weakness, because if a virus doesn't have its key, it can't decrypt and run its code.
There are, however, two other possibilities.
1. The key comes from outside an infected system:
- A virus can retrieve the key from a web site, but that would mean that the virus would then have to carry the web site's address with it, which could be blocked as a countermeasure. To avoid knowing a specific web site's name, a virus could use a web search engine to get the key instead.
Generally, any electronic data stream that a virus can monitor would be usable for key delivery, especially ones with high volumes of traffic that are unlikely to be blocked: email messages, Usenet postings, instant messaging, IRC, file-sharing networks.
- A binary virus is one where the virus is in two parts, and doesn't become virulent until both pieces are present on a system. There have only been a few binary viruses, such as Dichotomy and RMNS. One manifestation of binary viruses would be where virus V1 has stronglyencrypted code, and virus V2 has its key.
But this scheme is unlikely to work well in practice. If V1 and V2 travel together, then both will bear the same risk of capture and analysis, defeating the purpose of separating the encryption key.
If V1 and V2 spread separately (e.g., V2 is released a month after V1, and uses a different infection vector) then their spread would be independent. Now, say that P1 is the probability of Vi reaching a given machine, and P2 is that probability for V2. With an independent spread, the probability of them both finding the same machine and becoming virulent isP1xP2, i.e., smaller.
2. The key comes from inside an infected system.
Using environmental key generation, the decryption key is constructed of elements already present in the target's environment, like:
- The machine's domain name.
- The time or date.
- Some data in the system (e.g., file contents)
- The current user name.
- The interface's language setting (e.g., Chinese, Hebrew).
This makes it very easy to target viruses to particular individuals or groups. A target doesn't even know that they possess the key! Combined with strong encryption, environmental key generation would render a virus unanalyzable even if captured.
To fully analyze an encrypted virus, it has to be decrypted, and while the elements comprising the key may be discovered, the exact value of the key will not. In this case, the only real hope of decryption lies in a poor choice of key.
A poorly-chosen key with a relatively small range of possible values (e.g., the language setting) would be susceptible to a brute-force attack. How can the virus know that its decryption was successful?
It doesn't. While the virus could carry a checksum with it to verify that the decryption worked, that might give away information to an analyst. An alternative method is to catch exceptions that invalid code may cause, then try to run the decrypted "code" and see if it works.