Virus Databases and Description Languages

Up to now, the existence of a virus database for anti-virus software has been assumed but not discussed. Conceptually, a virus database is a database containing records, one for every known vims.

When a virus is detected using a known-virus detection method, one side effect is to produce a virus identifier. This virus identifier may not be the virus' name, or even be human-readable, but can be used to index into the virus database and find the record corresponding to the found virus.

A virus record will contain all the information that the anti-virus software requires to handle the virus. This may include:

  • A printable name for the virus, to display for the user.
  • Verification data for the virus. Again, a copy of the entire virus would not be present; the last section discussed other ways to perform verification.
  • Disinfection instructions for the virus.

Any virus signatures stored in the database must be carefully handled. Why?

Figure below illustrates a potential problem with virus databases, when more than one anti-virus program is present on a system. If virus signatures are stored in an unencrypted form, then one anti-virus program may declare another vendor's virus database to be infected, because it can find a wealth of virus signatures in the database file!

The safest strategy is to encrypt stored virus signatures, and never to decrypt them. Instead, the input data being checked for a signature can be similarly encrypted, and the signature check can compare the encrypted forms.

As new viruses are discovered, an anti-virus vendor will update their virus database, and all their users will require an updated copy of the virus database in order to be properly protected against the latest threats. This raises a number of questions:

  • How is a user informed of updates?

The typical model is that users periodically poll the anti-virus vendor for updates. The polling is done automatically by the anti-virus software, although a user can manually force an update to occur.

Another model is referred to as a push model, where the anti-virus vendor "pushes out" updates to users as soon as they are available. Many vendors use the polling model, but will email alerts about new threats to users upon request, permitting them to make an informed choice about updating.

  • Should updates be manual or automatic?

Automatic updates have the potential to provide current known-virus protection for users as soon as possible. Currency aside, some machines are not aggressively maintained by their users. Automatic updates are not always the best choice, however.

Antivirus software, like any software, can have bugs. It is rare, but possible, for a database update to cause substantial headaches for users because of this. In one case, a buggy update caused the networks of some Japanese railway, subway, and media organizations to be inaccessible for hours.

  • How often should updates be done?

Frequency of updates is in part a reflection of the rate at which new threats appear. Once upon a time, monthly updates would have been sufficient; now, weekly and daily updates may not be often enough.

  • How should updates be distributed?

Electronic distribution of updates, especially via the Internet, is the only viable means to disseminate frequent updates. This means that anti-virus vendors must have infrastructures for distributing updates that are able to withstand heavy load - a highly-publicized threat may cause many users to update at the same time.

The update process is an attractive target for attackers. It is something that is done often by users, and compromising updates would create a huge pool of vulnerable machines. The compromise may occur in a number of ways:

  • The vendor's machines that distribute the update may be attacked. - An update may be compromised at the vendor before reaching the distribution machines. Anti-virus vendors are amply protected internally from malware, but an inside threat is always possible.
  • A user machine may be spoofed, so that it connects to an attacker's machine instead of the vendor's machines.
  • A "man-in-the-middle" attack may be mounted, where an attacker is able to intercept communications between the user and vendor. An attacker may modify the real update, or inject their own update into the communications channel.

There is also the practical matter of what form the update will take. Transmitting a fresh copy of the entire virus database is not feasible due to the bandwidth demands it would place on the vendor's update infrastructure, not to mention the comparatively limited bandwidth that many users have.

The virus database will have a relatively small number of changes between updates, so instead of sending the entire database, a vendor can just send the changes to the database. These changes are sometimes called deltas.

Furthermore, these deltas can be compressed to try and make them smaller still. Downloaded deltas should be verified to protect against attacks and transmission errors.

The update mechanism can also be used to update the anti-virus engine itself, not just the virus database. This may be necessary to fix bugs, or add functionality required to detect new viruses. Known-virus scanners will need their data structures updated with the latest signatures as well.

Clearly, the information in the virus database and other updates from an anti-virus vendors must come from someplace. Anti-virus vendors often have an in-house virus description language, a domain-specific language designed to describe viruses, and how to detect, verify, and disinfect each one.

Anti-virus researchers create descriptions such as these, and a compiler for the virus description language translates them into the virus database format. Domain-specific languages tend to be very good at describing things in their domain, but not very good for general use.

Virus description languages can have escape mechanisms to call code written in a general-purpose language, code which is compiled and either interpreted or run natively. This allows special-purpose code to be written for detection, verification, or disinfection.

Special-purpose code can be used to direct the entire virus detection, instead of only being invoked when needed. For example, for viruses which have multiple entry points, special-purpose code can tell a scanner what locations it should scan.