Random Numbers

James Adrian
      Many individuals, businesses and other organizations have a legitimate and lawful need to keep some of their information confidential or even absolutely secret from competitors, the press and others. Because email has become vital to our work and because email is not secure, encryption is often necessary. Law-abiding people cannot be confused with criminals passing illicit secrets if they keep their messages and comply with lawful authority on those rare occasions when investigations of their activities are warranted.

      To create a message that only the sender and the intended receivers can understand, the message must be encrypted by means of a method that relies upon information that cannot be discovered by any examination of the transmission. If the encryption relies upon complexity or computational difficulty without involving secret information, the intended message can be discovered by third parties. To third parties, some aspect of the transmission must be unknowable.

      Within encryption technology, the embodiment of information unknowable to third parties often takes the form of random numbers kept by the sender and receivers and not known to anybody else. These random numbers are used to encrypt messages in such a way as to make decryption of a transmission impossible without access to those same random numbers. Random numbers are also needed in scientific experiments to ensure the absence of bias.

      Truly random numbers have no pattern in them. A string of random bits is analogous to noise. In the jargon of signal processing, noise can be white, meaning that it is very random, or it can be pink, meaning that it contains detectable regularities. It is within this context that von Neumann whitening came to be known. Invented by John von Neumann (1903 - 1957), it is a way of creating a string of bits that largely lacks orderly patterns that may exist in the source string. The method requires source two strings to be of the same length as each other and to be unrelated to each other. Here, a bit from string A and a bit from string B are used to determine the output according to this table:

	A	B	Output
	0	0	None
	0	1	1
	1	0	0
	1	1	None
      The resultant output string is more random than the two input strings but shorter. Two input strings, each 1000 bits in length, will produce and output sting that is about 250 bits in length (a four to one reduction in data).

      The von Neumann whitening algorithm points to a paradigm that is not often, if ever, used by writers of random number software. The usual approach is to generate long and complex strings of bits from an algorithm together with a short "seed" of numbers. This approach does not produce cryptographically strong random numbers and consequently their output is termed pseudo random. They produce a long string from a short string. The von Neumann algorithm works to extract randomness from a long string to generate a shorter one. It is not any traditional kind of compression because it does not provide much hope of retrieving the source data. It is not a hash algorithm because it does not produce an output of a predetermined length; and although a hash does produce a smaller output than its input, a hash is not traditionally designed to produce long and cryptographically strong strings of random bits. The paradigm that the von Neumann whitening algorithm points to warrants a different name. All of this is not to say that von Neumann whitening produces perfectly random strings.

      I concentrate on text files written in English as the first example of files from which randomness can be extracted. The randomness that can be extracted from large text files comes mainly from two sources: Firstly, it comes from the somewhat arbitrary choices (of which there are many) that writers make in deciding how to express themselves. Secondly, pairs of such files (or pairs of sections of such files) can be compared to reveal thoroughly coincidental digital features that appear randomly as two independently written strings of such data are examined together.

      There are a number of tools other than the von Neumann whitening algorithm that can be useful in extracting this randomness. A prominent one is the logical operation known as the Exclusive OR logical operation. It is built into every computer:

	A	B	Output
	0	0	0
	0	1	1
	1	0	1
	1	1	0
      The Exclusive OR value is 1 whenever the two inputs are different and it is 0 whenever the two inputs are the same. It says "yes" or "1" when either input is a 1 but says "no" or "0" if they are both 1. It accepts either 1 but not both - thus the name Exclusive OR. This logical operation reduces the data from two strings of a given size to one string of that same size. The output has half as much data as the input.

      Filtering out text file data that is not at all like a succession of thoughtfully constructed sentences can be done by deleting identical triplets of characters. Long strings of blanks, tabs or carriage returns (end of line characters) often present for formatting reasons, would be limited to two in a row.

      There is also a motive to exclude the most significant four bits of each byte because the least significant four bits are simply more evenly divided among its 16 possible numbers. This is because the one most significant bit in a byte is often constant, as in ASCII text which requires only 128 (not 256) characters. This would produce output data that is half the length of the input data.

      Applying each of these measures once would result in a data reduction factor of 16 with a further reduction due to excluding byte triples. A test which reduces 16 megabytes of text to somewhat less than one megabyte of output data and examines the statistical characteristics of the output would be a valuable first step.

      The von Neumann whitening and the Exclusive OR operation could be applied multiple times. If the Exclusive OR were used on two files to create a third, and then on another file with the third, the resultant file in each bit position would have a bit that was the parity (odd or even) of the bits in the same bit position in the source files. No matter how many source files are used this way, the only information that remains in the output file is the parity for each bit position.

      These methods can be used on source files other than text files. Picture files, compressed audio files of music, and audio files of noise are examples of workable source files.

      It would be interesting to see if using these techniques produces statistical results that stop getting more random at some ratio of input data to output data. If so, this would be a way to go beyond pseudo-random software. It might even be a way to go beyond the tiniest imperfections of hardware random number generators (which are always present) by using many different hardware random number generators to produce separate source files.

      I believe that at some ratio of input data to output data, even text files processed in this way will produce statistically random bits as output; however, if the input data has a high degree of structure (such as sentences) the ratio of the amount of input data versus the amount of output data might need to be large. To moderate this input to output ratio, files might be compressed before these whitening measures are taken.

      The source files for the methods described above could be obtained from audio noise instead of text or other inherently orderly files. Audio noise can be recorded secretly and can be reduced to binary files indicating that the sound pressure is above (1) or below (0) a midpoint. The imprecision of the midpoint can be corrected by concatenating sizable sections that happen to have as many ones as zeros; then, apply the methods described above.

      If your data absolutely must remain confidential, don't trust any random number generator that you can buy.