Saturday, August 5, 2006

Machine Translation: Library format

The Gistmass ELF (Extensible Linguistic Framework) Library contains converted text files of well-known prose including nursery rhymes, famous quotes, essays, articles, books, and the like. If somebody says, ‘Four score and seven years ago,’ it registers an exact hit. But if somebody says, ‘Five score and eleven years ago,’ it senses a similarity but not an exact hit. It sounds a little odd to both us and to the Gistmass ELF Machine Translation program.

The format of Library files is encoded. Library files are split in two, with a data fork and a format fork. The data fork is filled with sequences of root word indexes. For instance a verb includes past tense, present tense, and future tense; like ‘ran,’ ‘run,’ and ‘running,’ but they’re all the same root word.

Root word index numbers help speed up searches. Two groups of words can be converted into bit arrays: one bit per index. A bitwise AND operation compares two bit arrays for coinciding bits. A zero result means total independence. A non-zero result means coincidences exist, because the same index bit is ‘on’ in both arrays, which indicates that the same word belongs to both groups. The index of every resultant ‘on’ bit reveals the shared words. Here’s a truth table, where ‘1’ means ‘on’ and ‘0’ means ‘off.’

bit0101
wise0011
AND0001

This is the bitwise AND operation; only both bits ‘on’ results in an ‘on’ bit in the result, the last column. Bring up a calculator program and enter 2 then click the square root function, copy that number and do the same for the number 3. Here is what we get.

‘√2 = 1.4142135623730950488016887242097’
‘√3 = 1.7320508075688772935274463415059’

Let’s use these numbers to generate two random sequences of bits, changing every digit into a single bit. Let’s set every digit from zero to four ‘off’ and every digit from five to nine ‘on.’ Using this rubric, we’ll convert the two square roots into two aligned bit arrays, and draw a line under the second array. Now let's perform the bitwise AND operation. Just mark the ‘on’ bits using the table as a guide.

00000001100100110011001111000011
01000101011111110101010010001011







1


1

11


1



1




11

The bitwise AND is one of the fastest operations in a microprocessor. Let’s convert these bit arrays into index numbers with one on the left and thirty-two on the right. Here's two sequences of numbers, which represent the bit indexes of the ‘on’ bits:

‘8, 9, 12, 15, 16, 19, 20, 23, 24, 25, 26, 31, 32’
‘2, 6, 8, 10, 11, 12, 13, 14, 15, 16, 18, 20, 22, 25, 29, 31, 32’

Now, let’s write down the numbers that are common to both lists. ‘8, 12, 15, 16, 20, 25, 31, 32’ Was that faster or slower? It was slower. What if I shuffle the numbers out of sequence? Will that be faster or slower? It will be even slower, and the same is true for computers. Do you have any appreciation for the speed of various algorithms? Does the term ‘order of complexity’ ring a bell?

Order of Complexity
CNlog NN log NN2
C10110100
C100220010,000
C1,00033,0001,000,000
C10,000440,000100,000,000

These are common ‘orders of complexity’ where ‘N’ is the number of items under consideration. The first column is called ‘constant’ because the length of time it takes to perform the algorithm is independent of the number ‘N.’ For example; a thermostat decides whether or not to turn on the air conditioning, in a fixed amount of time. Temperatures cover a wide range, and heat sources vary, but the thermostat only determines whether the temperature is above or below some preset reference temperature, and then acts accordingly.

The second column is called ‘linear’ because the length of time it takes to perform an algorithm is directly proportional to the number ‘N.’ For example; contractors get paid by the hour. The hourly rates for different people may vary, but pay is directly proportional to hours worked. The third column ‘log N’ increases much slower than N. For example; this is the approach used to look up words in an alphabetized listing, like a dictionary or a telephone book. The fourth column ‘N log N’ gets progressively slower as ‘N’ increases. For example, counting out loud, where the larger the number the longer it takes to say all the digits. Like one million, forty-eight thousand, five hundred and seventy-six compared to saying six.














The last column ‘N’ squared slows down in a hurry as ‘N’ increases. Let’s consider the monitor configuration utility. The application adjusts the resolution of the display. If we double the resolution, then the number of pixels goes up by a factor of four, or two squared. If we tipple the resolution, then the number of pixels goes up by a factor of nine, or three squared.

Now, suppose we want to find phrases in our Library. Each search phrase is independent of all the others, and the size of the Library can increase. This is inherently an ‘N’ squared problem, but we want to speed it up as fast as possible. Let’s imagine we’re searching for the phrase, ‘Four score and seven years ago.’ This string has thirty characters, including spaces, but only six words. The Library contains thousands of files. Just one pass through every file takes minutes. Instead, we need to quickly identify those files with the greatest potential of containing this phrase. If we convert this phrase into root word indexes, then the capital F is redundant and so is the plural of ‘year.’ We then have six root word indexes that offset into a dictionary or into a bit array to detect coinciding bits. Our maximum index is sixty-four K which fits into sixteen bits or half of a thirty-two bit word.

Each index corresponds to a dictionary entry. Separate dictionaries exist for different languages, people, places, animals, plants, fictional characters, product brands, medical terms, acronyms, chemicals, and other specialized disciplines. The most common words appearing in the Library files are assigned the lowest index numbers, while the least common words are assigned the highest index numbers. Offline, we run a histogram for the entire Library to find out how often each word is used. As I said before, each Library file is encoded, and the data fork contains sequences of root word indexes. The format fork contains punctuation, capitalization, special characters, which verb tense and which dictionary is in use. In addition, each disk has a Master file that includes the precompiled bit arrays of every Library file on the disk. For every root word in every Dictionary, a bit is set ‘on’ for all the Library files containing that root word. A bitwise OR operation sets bits ‘on’ in a bit array without erasing any bits already set. Here’s another truth table, where ‘1’ means ‘on’ and ‘0’ means ‘off.’

bit0101
wise0011
OR0111

Let’s replace the bitwise AND results, we did before. Now let’s mark the ‘on’ bits using this table, and perform the bitwise OR operation.

00000001100100110011001111000011
01000101011111110101010010001011

1


1
111111111
111
11111

1
11

So our phrase has six root word indexes, and we want to find all the Library files that contain all six words. The Master file contains a precompiled list, for every root word in every Dictionary, of every Library file containing that root word. We can use it to compare the Library file lists for the six root words in our phrase. Would it be faster to compare the files as numbers or as bit arrays? Bit arrays. Let’s extend our example to six rows. Here are the results of four more square roots, and the corresponding bit arrays.

‘√5 = 2.2360679774997896964091736687313’
‘√6 = 2.4494897427831780981972840747059’
‘√7 = 2.6457513110645905905016157536393’
‘√8 = 2.8284271247461900976033774484194’

00000001100100110011001111000011
01000101011111110101010010001011
00010111110111111110010101111000
00010111001100110110110100101011
01011100000101101101001011101010
01010010001010100111000110010010














1

















In the computer this involves fetching six words of memory and five AND operations. So now we have a result with a bit set ‘on’ for every file that contains all six words. We need to get from that bit array to the Library file. We have thirty-two bits representing thirty-two files. If we had a corresponding table of thirty-two file references, then we can use the bit index as an index into the file reference table. In this case it’s only one file, but an outcome of many files is possible. Testing ever bit is thirty-two operations, but table lookups are faster, just as counting fingers is slower than memorizing ‘five plus five equals ten.’ Let’s group the columns together in groups of eight columns.

If the least significant bit is bit zero, then that bit is bit seventeen. A byte is eight bits. Eight bits can address two hundred and fifty-six unique locations. We can precompute a table offline that translates bits into indexes. Every combination of eight bits is a unique index into the table. Each table entry points to a null terminated string containing numbers from one to eight indicating which bits are ‘on’ for that arrangement of bits. Thirty-two bits contains four bytes. The base pointer into the file reference table needs to be adjusted for every byte of the bit array so that a number from one to eight indexes the correct file reference. Then we open a Library file, with a file reference, and search for the exact sequence of words in the phrase, and we report a hit if an exact match is detected. It’s all accomplished much faster than opening every file on a disk and searching for the phrase anywhere in the file.

But that only covers thirty-two files. If we double the number of files, we double the number of operations. That’s a linear ‘order of complexity.’ Thirty-two times thirty-two is one thousand twenty-four. A thousand files in not a lot, for our application. Unless we have a way to rule out files that’s thirty-two times the number of operations.


Master File Record
1ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef
2ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef
3ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef
4ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef
5ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef
6ABCDEFGHIJKLMNOPQRSTUVWXYZabcdef

What you see here is the format of one record in the Master file. This example shows thirty-two unique letters: capital ‘A’ through ‘Z’ and lower case ‘a’ through ‘f.’ The cells along the bottom row, row ‘6’ contain bit arrays representing thirty-two files. That’s ten twenty-four files in all. Every cell with multiple letters combines the bit arrays using a bitwise OR. Notice the top row. Row ‘1’ contains all the letters in one cell. That means all the bit arrays referencing every Library file are combined using a bitwise OR. Each cell represents one word in the Master File Record. The total number of cells, and words in a Master File Record, is one less than double the number of cells in the bottom row. A Master File Record exists for every root word in every Dictionary.

Notice the number of cells in consecutive rows doubles. The next row would contain sixty-four cells. The row after that one hundred twenty-eight, and so forth depending on the number of Library files on the disk. But let’s stick with this example of thirty-two cells. Suppose one of the words we’re interested in only occurs in lower case letter cells. How many cells is that? It’s fourteen.

That’s less than thirty-two. That means only cells with lower case letters have bits set ‘on’ and all other cells are zero. Let’s walk through the steps. To make it easier, let’s call a cell by its row number and the first letter it contains. Cell ‘1A’ is non-zero. From that we know at least one file contains the word we seek. So we move down to row ‘2’ where cell ‘2A’ is zero. We ignore that branch. Half of the Library files are eliminated. We know that cell ‘2Q’ has bits set, without testing it. So we move down to row ‘3’ where cell ‘3Q’ is zero. We ignore that branch. Three quarters of the Library files are now eliminated. We know that cell ‘3Y’ has bits set, without testing it. So we move down to row ‘4’ where both cell ‘4Y’ and cell ‘4c’ are non-zero.

At this point we have two branches to consider, so we need additional help. We perform a bitwise AND test of cell ‘4Y’ using all root words in our phrase. If the result is zero we eliminate that branch, otherwise we push cell ‘4c’ on a stack to consider later, and move down to row ‘5’ where cell ‘5Y’ is zero. We ignore that branch. We know that cell ‘5a’ has the bits set, without testing it. So we move down to row ‘6’ where both cell ‘6a’ and cell ‘6b’ are non-zero. We perform a bitwise AND test of cell ‘6a’ using all root words in our phrase. If the result is zero we eliminate that branch, otherwise we push cell ‘6b’ on the stack. Any non-zero, bitwise AND result on cell ‘6a’ represents one or more Library files that contain all the root words in our phrase. So we go through the process of searching any of those files for an exact match. If we find one, then we post it.

On the way down the rows, we need to keep track of all the branches we pass that still need testing. We put those cell addresses on a push down stack, which is just a LIFO buffer. We know how deep the stack is in advance: one less than the number of rows, which equals the base two logarithm of the number of cells in the last row. When we finish searching a file, we pop the stack and continue the process until the stack is empty. Then we’re done.

This same binary search mechanism helps determine context, by searching the Associative database files for coincidences, like between the words ‘coconut’ and ‘automobile’, which might occur together in a hierarchy of physical objects.
_________________________
K = 1024 in computer terminology, which is 210; and 64K = 216, which fits in 16 bits.