Future Beacon

Random Numbers from Text and Other Data

by James Adrian

      Many individuals, businesses and other organizations have a legitimate and lawful need to keep some of their information confidential or even absolutely secret from competitors, the press and others. Because email has become vital to our work and because email is not secure, encryption is often necessary. Law-abiding people cannot be confused with criminals passing illicit secrets if they keep their messages and comply with lawful authority on those rare occasions when investigations of their activities are warranted.

      To create a message that only the sender and the intended receivers can understand, the message must be encrypted by means of a method that relies upon information that cannot be discovered by any examination of the transmission. If the encryption relies upon complexity or computational difficulty without involving secret information, the intended message can be discovered by third parties. To third parties, some aspect of the transmission must be unknowable.

      Within encryption technology, the embodiment of information unknowable to third parties often takes the form of random numbers kept by the sender and receivers and not known to anybody else. These random numbers are used to encrypt messages in such a way as to make decryption of a transmission impossible without access to those same random numbers. Random numbers are also needed in scientific experiments to ensure the absence of bias.

      Truly random numbers have no pattern in them. A string of random bits is analogous to noise. In the jargon of signal processing, noise can be white, meaning that it is very random, or it can be pink, meaning that it contains detectable regularities. It is within this context that von Neumann whitening came to be known. Invented by John von Neumann (1903 - 1957), it is a way of creating a string of bits that largely lacks orderly patterns that may exist in the source strings. The method requires the source strings to be of the same length and to be unrelated to each other. Here, a bit from string A and a bit from string B are used to determine the output according to this table:
	A	B	Output
	0	0	None
	0	1	1
	1	0	0
	1	1	None
      The resultant output string is more random than the input strings but shorter. Two input strings, each 1000 bits in length, will produce and output sting that is about 250 bits in length (a four to one reduction in data).

      The von Neumann whitening algorithm points to a paradigm that is not often if ever used by writers of random number software. The usual approach is to generate long and complex strings of bits from an algorithm together with a short "seed" of numbers. This approach does not produce cryptographically strong random numbers and consequently their output is termed pseudo random. They produce a long string from a short string. The von Neumann algorithm works to extract randomness from a long string to generate a shorter one. It is not any traditional kind of compression because it does not provide any hope of retrieving the source data. It is not a hash algorithm because it does not produce an output of a predetermined length; and although a hash does produce a smaller output than its input, a hash is not traditionally designed to produce long and cryptographically strong strings of random bits. The paradigm that the von Neumann whitening algorithm points to warrants a different name.

      I concentrate on text files written in English as the first example of files from which randomness can be extracted. They are plentiful and fruitful. The randomness that can be extracted from large text files comes mainly from two sources: Firstly, it comes from the somewhat arbitrary choices (of which there are many) that writers make in deciding how to express themselves. Secondly, pairs of such files (or pairs of sections of such files) can be compared to reveal thoroughly coincidental digital features that appear randomly as two independently written strings of such data are examined together.

      There are a number of tools other than the von Neumann whitening algorithm that can be useful in extracting this randomness. A prominent one is the logical operation known as the Exclusive OR logical operation. It is built into every computer:
	A	B	Output
	0	0	0
	0	1	1
	1	0	1
	1	1	0
      The Exclusive OR value is 1 whenever the two inputs are different and it is 0 whenever the two inputs are the same. It says "yes" or "1" when either input is a 1 but says "no" or "0" if they are both 1. It accepts either 1 but not both - thus the name Exclusive OR. This logical operation reduces the data from two strings of a given size to one string of that same size. The output has half as much data as the input.

      Filtering out text file data that is not at all like a succession of thoughtfully constructed sentences can be done by deleting identical triplets of characters. Long strings of blanks, tabs or carriage returns (end of line characters) often present for formatting reasons, would be limited to two in a row.

      There is also a motive to exclude the most significant four bits of each byte because the least significant four bits are simply more evenly divided among its 16 possible numbers. This is because the one most significant bit in a byte is often constant, as in ASCII text which requires only 128 (not 256) characters. This would produce output data that is half the length of the input data.

      Applying each of these measures once would result in a data reduction factor of 16 with a further reduction due to excluding byte triples. A test which reduces 16 megabytes of text to somewhat less than one megabyte of output data and examines the statistical characteristics of the output would be a valuable first step.

      The von Neumann whitening and the Exclusive OR operation could be applied multiple times. If the Exclusive OR were used on two files to create a third, and then on another file with the third, the resultant file in each bit position would have a bit that was the parity (odd or even) of the bits in the same bit position in the source files. No matter how many source files are used this way, the only information that remains in the output file is the parity for each bit position.

      These methods can be used on source files other than text files. Picture files, compressed audio files of music, and audio files of noise are examples of workable source files.

      It would be interesting to see if using these techniques produces statistical results that stop getting more random at some ratio of input data to output data. If so, this would be a way to go beyond pseudo-random software. It might even be a way to go beyond the tiniest imperfections of hardware random number generators (which are always present) by using many different hardware random number generators to produce separate source files.

      I believe that at some ratio of input data to output data, even text files processed in this way will produce statistically random bits as output; however, if the input data has a high degree of structure (such as sentences) the ratio of the amount of input data versus the amount of output data might need to be large. To moderate this input to output ratio, files might be compressed before these whitening measures are taken.

      The following algorithm is intended to transform text into a string of bytes that include all 256 possible bytes (equally distributed) while removing the regularities of correct spelling. In the text, a "q" is always followed by a "u" but in the output array, "q" might be followed by any letter or, in fact, any byte.

N = a power of 2 greater than 1000       A megabyte might be typical.
i = 0
j = 0
k = a number between 10 and N/2 that is not divisible by 2
A, B, C, D and OUT are arrays of size N. Ai is the ith byte in array A
Save = 0
      This is a byte that does not need to be initialized.

While j < N - 1
      Save = (Ai + Bi + Ci + Di) MOD 256
      i = (i + k) MOD N
      OUTj = Save
      j = j + 1

      This algorithm produces a more equal distribution of byte values in the output if there are many arrays being summed (there are four in this example) and if the source files are divided into small arrays (1024 bytes as opposed to one megabyte); however, this drastically affects the ratio of input to output. The most profitable balance can found through statistical tests. The following algorithm seeks to balance the frequency of byte values:

i = 0
u = 0
N is the array length
Sum = array0

While i < N
      u = (i+1) MOD N
      arrayi = (Sum + arrayi + arrayu) MOD 256
      IF arrayi = 0, THEN (arrayi = arrayi + 37) MOD 256
      Sum = (Sum + arrayi + i) MOD N
      i = i + 1

      If the number of input arrays is even, pairs of them can be preprocessed by the von Neumann algorithm as well.

      The bits of a sequentially constructed file (such as a text file) has many relationships that are not controlled or intended. One of these types is between bits separated by a great distance within the file. It is hard to imagine a strong correlation between such pairs of single bits. If by any algorithm that uses data in the file itself, bit offset addresses can be constructed, the succession of bits pointed to by the successive addition of such bit addresses could be used to create a new file in which each successive bit has little to do with its neighbors. This example produces an output array that contains approximately one quarter of the data size of the input:

N = a power of 2 greater than 4000      This N is a number of bits, not bytes. A megabit might be typical.
A = a bit array of length N
B = a bit array of length N
L = minimum number of bits required to represent number N
j = 0
      This is the bit offset into array A.
k = 2000      This is the bit offset into array B.
Z = destination bit array
h = 0
      This is the bit offset into array Z (the output array).
LbitNumber = an L bit number not initialized

While h < N
      LbitNumber = Ak, A(k+1)...A(k+(L-1))
      Each subscript calculated MOD N
      j = (j + LbitNumber) MOD N
      LbitNumber = Bj, B(j+1)...B(j+(L-1))
      Each subscript calculated MOD N
      k = (k + LbitNumber) MOD N
      IF A(k) = B(j), continue
      If equal, continue the loop from the top.
      Z(h) = B(j)
      h = h + 1

      The usefulness of whitening processes is proved by statistical tests of randomness. The virtually unused paradigm of extracting randomness from large files rather than generating pseudo-random numbers from small seeds is the reason for suspecting that these processes have value.


      Please feel free to write to me directly for more information or to make suggestions or comments. My email address is jim@futurebeacon.com. You can also go to my contact page to get my full contact information. Suggestions, questions, additional information and critiques are very welcome.