Anti-Virus Techniques
Anti-virus software does up to three major tasks:
- Detection - Detecting whether or not some code is a virus or not which, in the purest form of detection, results in a Boolean value: yes, this code is infected, or no, this code is not infected.
Ultimately, detection is a losing game. Precisely detecting viruses by their appearance or behavior is provably undecidable - a virus writer can always construct a virus which is undetectable by some anti-virus software.
Then the anti-virus software can be updated to detect the new virus, at which point the virus writer can build another new virus, and so on. Should a virus always be detected, even if it can't run?
Yes. Even if a virus is dormant on one system, it is still useful to detect it so that the virus doesn't affect another system. Anti-virus software is regularly applied to incoming email, for instance, where the email recipient's machine is different from the machine running the mail server and anti-virus software.
The other case is where a virus won't run on any system. Finding an intended virus may point to some underlying security flaw, and thus it can be useful to detect those viruses too.
- Identification - Once a virus is detected, which virus is it? The identification process may be distinct from detection, or identification may occur as a side effect of the detection method being used.
- Disinfection - Disinfection is the process of removing detected viruses; this is sometimes called cleaning. Normally a virus would need to be precisely identified in order to perform disinfection.
Detection and disinfection can be performed using generic methods that try to work with known and unknown viruses, or using virus-specific methods which only work with known viruses. (Virus-specific methods may catch unknown variants of known viruses, however.)
It is arguably the most important of the three tasks above, because identification and disinfection both require detection as a prerequisite. In addition, early detection (i.e., before an infection has occurred) completely alleviates the need for the other tasks. There are five possible outcomes for detection.
Perfect virus detection would always have the outcomes circled on the diagonal, where a virus is detected if one is really present, and no virus is detected if none is there.
Detection isn't perfect, though. K false positive is when the anti-virus software reports a virus even though a virus isn't really there, which can waste time and resources on wild goose chases. A false negative, or a miss, is when anti-virus software doesn't detect a virus that's present.
Either type of false reading serves to undermine user confidence in the anti-virus software. The fifth outcome is ghost positives, where a virus is detected that is no longer there, but a previous attempt at disinfection was incomplete and left enough virus remnants to still be detected.
Detection methods can be classified as static or dynamic, depending on whether or not the virus' code is running when the detection occurs.
Detection - Static Methods
Static anti-virus techniques attempt virus detection without actually running any code. Thera are three static techniques: scanners, heuristics, and integrity checkers.
Scanners
The term "scanner" in the context of anti-virus software is another term which has been diluted through common usage, like "virus" itself. It is often applied generically to refer to anti-virus software, regardless of what technique the anti-virus software is using.
Scanners can be classified based on when they are invoked:
- On-demand - On-demand scanners run when explicitly started by the user. Many anti-virus techniques draw upon a database of information about current threats, and forcing an on-demand scan is useful when a new virus database is installed. An on-demand scan may also be desirable when an infection is suspected, or when a questionable file is downloaded.
- On-access - An on-access scanner runs continuously, scanning every file when it's accessed. As might be expected, the extra I/O overhead and resources consumed by the scanner impose a performance penalty.
Some on-access scanners permit tuning, so that scans are only performed for read accesses or write accesses; normally scanning would be done for both. A machine where all files arrive via the network may only want scanning on write accesses, for example, because that would provide complete anti-virus coverage while minimizing the performance hit.
Each virus is represented by one or more patterns, or signatures, sequences of bytes which (hopefully) uniquely characterize the virus. Signatures are sometimes called scan strings, and need not be constant strings.
Some anti-virus software may support "don't care" symbols called wildcards that match an arbitrary byte, a part of a byte, or zero or more bytes. The process of searching for viruses by looking through a file for signatures is called scanning, and the code that does the search is called a scanner.
More generally, the search is done through a stream of bytes, which would include the contents of a boot block, a whole file, part of a file being written or read, or network packets.
With hundreds of thousands of signatures to look for, searching for them one at a time is infeasible. The biggest technical challenge in scanning is finding algorithms which are able to look for multiple patterns efficiently, and which scale well.
Static Heuristics
Anti-virus software can employ static heuristics in an attempt to duplicate expert anti-virus analysis. Static heuristics can find known or unknown viruses by looking for pieces of code that are generally "virus-like," instead of scanning for specific virus signatures.
This is a static analysis technique, meaning that the code being analyzed is not running, and there is no guarantee that any suspicious code found would ever be executed.
Static heuristic analysis is done is two steps:
- Data: the Gathering. Data can be collected using any number of static heuristics. Whether or not any one heuristic correctly classifies the input is not critical, because the results of many heuristics will be combined and analyzed later.
- Analysis. As hinted at by the terms "booster" and "stopper," analysis of static heuristic data may be as simple as weighting each heuristic's value and summing the results. If the sum passes some threshold, then the input is deemed to be infected.
Signatures of suspicious code will most likely be chosen by expert anti-virus researchers. This process can be automated, however, at least for some restricted domains: IBM researchers automatically found static heuristic signatures for BSIs. They took two corpuses of boot blocks, one exclusively containing BSIs, one with no infections.
A computer found trigrams - sequences of three bytes - which appeared frequently in the BSI corpus but not in the other corpus. Finally, they computed a 4-cover such that each BSI had at least four of the found BSI trigrams. After this process, they were left with a set of only fifty trigrams to look for.
The presence or absence of these trigrams was used to classify a boot block as infected or not. Static heuristics may be viewed as a way to reduce the resource requirements of anti-virus scanners.
Full virus signatures in a virus database can be distilled down to a set of short, generic, static heuristic signatures. (The distillation may even be done automatically, using the IBM technique just described.)
An antivirus scanner can look for these short signatures, loading in their associated set of full virus signatures only if a match is found. This alleviates the need to keep full signatures in memory.
Integrity Checkers
With the exception of companion viruses, viruses operate by changing files. An integrity checker exploits this behavior to find viruses, by watching for unauthorized changes to files. Integrity checkers must start with a perfectly clean, 100% virus-free system; it is impossible to understate this.
The integrity checker initially computes and stores a checksum for each file in the system it's watching. Later, a file's checksum is recomputed and compared against the original, stored checksum. If the checksums are different, then a change to the file occured. There are three types of integrity checker:
- Offline. Checksums are only verified periodically, e.g., once a week.
- Self-checking. Executable files are modified to check themselves when run. Ironically, modifying executables to self-check their integrity involves virus-like mechanisms. Self-checking can be done in a less-obtrusive way by adding the self-checking code into shared libraries.
In general, anti-virus software will perform integrity self-checking, regardless of the anti-virus technique it uses. The allure of attacking anti-virus software is too great to ignore.
- Integrity shells. An executable file's checksum is verified immediately prior to execution. This can be incorporated into the operating system kernel for binary executable files; the ideal positioning is less clear for other types of "executable" files, like batch files, shell scripts, and scripting language programs.
Detection: Dynamic Methods
Dynamic anti-virus techniques decide whether or not code is infected by running the code and observing its behavior.
Behavior Monitors/Blockers
A behavior blocker is anti-virus software which monitors a running program's behavior in real time, watching for suspicious activity. If such activity is seen, the behavior blocker can prevent the suspect operations from succeeding, can terminate the program, or can ask the user for the appropriate action to perform.
Behavior blockers are sometimes called behavior monitors, but the latter term implies (rightly or wrongly) that no action is taken, and the burglars are only watched while they steal the silver. What does a behavior blocker look for?
Roughly speaking, a behavior blocker watches for a program to stray from what the blocker considers to be "normal" behavior. Normal behavior can be modeled in three ways, by describing:
- The actions that are permitted. This is called positive detection.
- The actions that are not permitted, called negative detection.
- Some combination of the two, in much the same way that static heuristics included boosters and stoppers.
Emulation
Behavior blocking allowed code to run on the real machine. In contrast, anti-virus techniques using emulation let the code being analyzed run in an emulated environment. The hope is that, under emulation, a virus will reveal itself. Because any virus found wouldn't be running on the real computer, no harm is done.
Comparison of Anti-Virus Detection Techniques
No one technique is best for detecting every type of virus, and a combination of techniques is the most secure design.
Scanning:
- Pro: Gives precise identification of any viruses that are found. This characteristic makes scanning useful by itself, as well as in conjunction with other anti-virus techniques.
- Con: Requires an up-to-date database of virus signatures for scanning to be effective. Even assuming that users update their virus databases right away, which isn't the case, there is a delay between the time when a new threat is discovered and when an anti-virus company has a signature update ready.
This leaves open a window of opportunity in which systems can be compromised. Also, scanning only finds known viruses, and some minor variants of them.
Static heuristics:
- Pro: Static heuristic analysis detects both known and unknown viruses.
- Con: False positives are a major problem, and a detected virus is neither identified, nor disinfectible except by using generic methods.
Integrity checkers:
- Pro: Integrity checkers boast high operating speeds and low resource requirements. They detect known and unknown viruses.
- Con: Detection only occurs after a virus has infected the computer, and the source of the infection can't necessarily be pinpointed. An integrity checker can't detect viruses in newly-created files, or ones modified legitimately, such as through a software update. Ultimately, the user will be called upon to assess whether a change to a file was made legitimately or not. Finally, found viruses can't be identified or disinfected.
Behavior blockers:
- Pro: Known and unknown viruses are detected.
- Con: While a behavior blocker knows which executable is the problem, unlike an integrity checker, it again cannot identify or disinfect the virus. Run-time overhead and false positives are a concern, as is the fact that the virus is already running on the system prior to being detected.
Emulation
- Pro: Any viruses found are running in a safe environment. Known and unknown viruses are detected, even new polymorphic viruses.
- Con: Emulation is slow. The emulator may stop before the virus reveals itself, and even so, precise emulation is very hard to get correct. The usual concerns about identification and disinfection apply to emulation, too.
In general, dynamic methods impose a run-time overhead for monitoring that is not incurred by static methods. The tradeoff is that dynamic methods, by watching code run, effectively peel away a layer of obfuscation from viral code.