Checksums

Gosu provides an API for generating checksums. A checksum is a calculated number to detect modification of data in transit. Such modification can be either accidental or malicious. For example, you can use a checksum to detect corrupted stored data or errors in a communication channel. Longer checksums such as 64-bit checksums are also known as fingerprints.

To improve detection of accidental modification of data in transit, you can use checksums. A checksum is a computed value generated from an arbitrary block of digital source data. To check the integrity of the data at a later time, recompute the checksum and compare it with the stored checksum. If the checksums do not match, the data was altered either intentionally or unintentionally. For example, this technique can help detection of physical data corruption or errors in a communication channel.

Be aware that checksums cannot perfectly protect against intentional corruption by a malicious agent. A malicious attacker could modify the data in a way that preserves its checksum value or, depending on transport integrity, substitute a new checksum. To guard against malicious changes, use encryption at the data level with a cryptographic hash or the transport level, such as SSL/HTTPS.

Warning: Checksums improve detection from accidental modification of data but cannot detect all intentional corruption by a malicious agent. If you need that level of protection, use encryption instead of checksums, or in addition to checksums.

You can also use fingerprints to design caching and synchronizing algorithms that check whether data changed since the last cached copy. You can save the fingerprint of the cached copy and an external system can generate a fingerprint of its most current data. If you have both fingerprints, compare them to determine if you must resynchronize the data. To work effectively, the fingerprint algorithm must provide near certainty that a real-world change would change the fingerprint. Having two different values generate a matching fingerprint is called a collision. A fingerprint uniquely identifies the data for most practical purposes, although changed data causing a collision is theoretically possible.

Gosu provides support for 64-bit checksums in the class FP64 in the package gw.util.fingerprint.

The FP64 class provides methods for computing 64-bit fingerprints of the following kinds of data:

String objects
Character arrays
Byte arrays
Input streams

Fingerprints provide a probabilistic guarantee that defines a mathematical upper bound on the probability of a collision. A collision occurs if two different strings have the same fingerprint. Using 64-bit fingerprints, the odds of a collision are extremely small. The odds of a collision between two randomly chosen texts a million characters long are less than 1 in a trillion.

Suppose you have a set S of n distinct strings each of which is at most m characters long. The odds of any two different strings in S having the same fingerprint is described by the following equation where k is the number of bits in the fingerprint:

(nm^2) / 2^k

For practical purposes, you can treat fingerprints as uniquely identifying the bytes that produced them. The following mathematical notation demonstrates this assumption, given two String variables s1 and s2, using the → symbol to mean “implies”:

new FP64(s1).equals(new FP64(s2)) → s1.equals(s2)

Do not fingerprint the value of a fingerprint by fingerprinting the output of the FP64 methods toBytes and toHexString. Using this type of fingerprint might lead to unexpected collisions. The shorter length of the fingerprint itself invalidates the probabilistic guarantee that a collision is unlikely to occur.