Random Numbers from Text and Other Data
by James Adrian
Many individuals, businesses and other
organizations have a legitimate and lawful need to keep some of their information confidential or even absolutely secret from competitors,
the press and others. Because email has become vital to our work and because email is not secure, encryption is often necessary. Law-abiding people
cannot be confused with criminals passing illicit secrets if they keep their messages and comply with lawful authority on those rare occasions when
investigations of their activities are warranted.
To create a message that only the sender and the
intended receivers can understand, the message must be encrypted by means of a method that relies upon information that cannot be discovered by any
examination of the transmission. If the encryption relies upon complexity or computational difficulty without involving secret information, the intended
be discovered by third parties. To third parties, some aspect of the transmission must be unknowable.
Within encryption technology, the embodiment of
information unknowable to third parties often takes the form of random numbers kept by the sender and receivers and not known to anybody else. These
random numbers are used to encrypt messages in such a way as to make decryption of a transmission impossible without access to those same random
numbers. Random numbers are also needed in scientific experiments to ensure the absence of bias.
Truly random numbers have no pattern in them.
A string of random bits is analogous to noise. In the jargon of signal processing, noise can be white
, meaning that it is very random, or it can be
, meaning that it contains detectable regularities. It is within this context that von Neumann whitening
came to be known. Invented
by John von Neumann (1903 - 1957), it is a way of creating a string of bits that largely lacks orderly patterns that may exist in the source strings. The
method requires the source strings to be of the same length and to be unrelated to each other. Here, a bit from string A and a bit from string B are used
to determine the output according to this table:
A B Output
0 0 None
0 1 1
1 0 0
1 1 None
The resultant output string is more random than
the input strings but shorter. Two input strings, each 1000 bits in length, will produce and output sting that is about 250 bits in length (a four to one
reduction in data).
The von Neumann whitening algorithm points to
a paradigm that is not often if ever used by writers of random number software. The usual approach is to generate long and complex strings of bits from
an algorithm together with a short "seed" of numbers. This approach does not produce cryptographically strong random numbers and consequently their
output is termed pseudo random
. They produce a long string from a short string. The von Neumann algorithm works to extract randomness
from a long string to generate a shorter one. It is not any traditional kind of compression because it does not provide any hope of retrieving the source
data. It is not a hash algorithm because it does not produce an output of a predetermined length; and although a hash does produce a smaller output than
its input, a hash is not traditionally designed to produce long and cryptographically strong strings of random bits. The paradigm that the von Neumann
whitening algorithm points to warrants a different name.
I concentrate on text files written in English as the
first example of files from which randomness can be extracted. They are plentiful and fruitful. The randomness that can be extracted from large text
files comes mainly from two sources: Firstly, it comes from the somewhat arbitrary choices (of which there are many) that writers make in deciding how
to express themselves. Secondly, pairs of such files (or pairs of sections of such files) can be compared to reveal thoroughly coincidental digital features
that appear randomly as two independently written strings of such data are examined together.
There are a number of tools other than the von
Neumann whitening algorithm that can be useful in extracting this randomness. A prominent one is the logical operation known as the Exclusive
logical operation. It is built into every computer:
A B Output
0 0 0
0 1 1
1 0 1
1 1 0
The Exclusive OR value is 1 whenever the two
inputs are different and it is 0 whenever the two inputs are the same. It says "yes" or "1" when either input is a 1 but says "no" or "0" if they are both 1.
It accepts either 1 but not both - thus the name Exclusive OR
. This logical operation reduces the data from two strings of a given size to one
string of that same size. The output has half as much data as the input.
Filtering out text file data that is not at all like a
succession of thoughtfully constructed sentences can be done by deleting identical triplets of characters. Long strings of blanks, tabs or carriage returns
(end of line characters) often present for formatting reasons, would be limited to two in a row.
There is also a motive to exclude the most
significant four bits of each byte because the least significant four bits are simply more evenly divided among its 16 possible numbers. This is because
the one most significant bit in a byte is often constant, as in ASCII text which requires only 128 (not 256) characters. This would produce output data
that is half the length of the input data.
Applying each of these measures once would result
in a data reduction factor of 16 with a further reduction due to excluding byte triples. A test which reduces 16 megabytes of text to somewhat less than
one megabyte of output data and examines the statistical characteristics of the output would be a valuable first step.
The von Neumann whitening and the Exclusive
OR operation could be applied multiple times. If the Exclusive OR were used on two files to create a third, and then on another file with the third, the
resultant file in each bit position would have a bit that was the parity (odd or even) of the bits in the same bit position in the source files. No matter how
many source files are used this way, the only information that remains in the output file is the parity for each bit position.
These methods can be used on source files other than
text files. Picture files, compressed audio files of music, and audio files of noise are examples of workable source files.
It would be interesting to see if using these
techniques produces statistical results that stop getting more random at some ratio of input data to output data. If so, this would be a way to go beyond
pseudo-random software. It might even be a way to go beyond the tiniest imperfections of hardware random number generators (which are always
present) by using many different hardware random number generators to produce separate source files.
I believe that at some ratio of input data to output
data, even text files processed in this way will produce statistically random bits as output; however, if the input data has a high degree of structure (such
as sentences) the ratio of the amount of input data versus the amount of output data might need to be large. To moderate this input to output ratio, files
might be compressed before these whitening measures are taken.
The following algorithm is intended to transform
text into a string of bytes that include all 256 possible bytes (equally distributed) while removing the regularities of correct spelling. In the text, a "q" is
always followed by a "u" but in the output array, "q" might be followed by any letter or, in fact, any byte.
N = a power of 2 greater than 1000
A megabyte might be typical.
i = 0
j = 0
k = a number between 10 and N/2 that is not divisible by 2
A, B, C, D and OUT are arrays of size N. Ai is the ith byte in array A
Save = 0
This is a byte that does not need to be initialized.
While j < N - 1
Save = (Ai + Bi + Ci + Di) MOD 256
i = (i + k) MOD N
OUTj = Save
j = j + 1
This algorithm produces a more equal distribution
of byte values in the output if there are many arrays being summed (there are four in this example) and if the source files are divided into small arrays
(1024 bytes as opposed to one megabyte); however, this drastically affects the ratio of input to output. The most profitable balance can found through
statistical tests. The following algorithm seeks to balance the frequency of byte values:
i = 0
u = 0
N is the array length
Sum = array0
While i < N
u = (i+1) MOD N
arrayi = (Sum + arrayi + arrayu) MOD 256
IF arrayi = 0, THEN (arrayi = arrayi + 37) MOD 256
Sum = (Sum + arrayi + i) MOD N
i = i + 1
If the number of input arrays is even, pairs of them
can be preprocessed by the von Neumann algorithm as well.
The bits of a sequentially constructed file (such as a
text file) has many relationships that are not controlled or intended. One of these types is between bits separated by a great distance within the file. It is
hard to imagine a strong correlation between such pairs of single bits. If by any algorithm that uses data in the file itself, bit offset addresses can be
constructed, the succession of bits pointed to by the successive addition of such bit addresses could be used to create a new file in which each successive
bit has little to do with its neighbors. This example produces an output array that contains approximately one quarter of the data size of the input:
N = a power of 2 greater than 4000
This N is a number of bits
, not bytes. A megabit
might be typical.
A = a bit array of length N
B = a bit array of length N
L = minimum number of bits required to represent number N
j = 0
This is the bit offset into array A.
k = 2000
This is the bit offset into array B.
Z = destination bit array
h = 0
This is the bit offset into array Z (the output array).
LbitNumber = an L bit number not initialized
While h < N
LbitNumber = Ak, A(k+1)...A(k+(L-1))
Each subscript calculated MOD N
j = (j + LbitNumber) MOD N
LbitNumber = Bj, B(j+1)...B(j+(L-1))
Each subscript calculated MOD N
k = (k + LbitNumber) MOD N
IF A(k) = B(j), continue
If equal, continue the loop from the top.
Z(h) = B(j)
h = h + 1
The usefulness of whitening processes is proved by
statistical tests of randomness. The virtually unused paradigm of extracting randomness from large files rather than generating pseudo-random numbers
from small seeds is the reason for suspecting that these processes have value.
Please feel free to write to
me directly for more information or to make suggestions or comments. My email address is
. You can also go to
my contact page
my full contact information. Suggestions, questions, additional information and critiques are very welcome.