Everything in its right place

There's no running away from it. One day or another, you will lose sleep over the death-defying question: how can I retrieve a specific entry in a large dataset? The simplest solution to the problem is a straightforward linear search in which every entry is compared, sequentially, starting from the bottom of the pile, all the way to the top until we find what we need. It looks like this:

DECLARE SUB inList (src AS STRING, lst() AS STRING, lstSize AS INTEGER)

DIM tmpStr AS STRING, lst(31) AS STRING, lstSize AS INTEGER

' setup list
DO
  READ tmpStr
  IF LEN(tmpStr) THEN
    lst(lstSize) = tmpStr
    lstSize = lstSize + 1
  END IF
LOOP UNTIL (LEN(tmpStr) = 0) OR (lstSize > UBOUND(lst))

' do a linear search
inList "Brendon", lst(), lstSize

' list
DATA "Ruth","Karen","Brendon","Dylan","Jeremy","Sophia","Mindy","Joan"
DATA "Jill","Goldie","Brenda","Burt","Michael","Juan","William",""

' linear search
SUB inList (src AS STRING, lst() AS STRING, lstSize AS INTEGER)
  FOR i% = 0 TO lstSize - 1
    IF (lst(i%) = src) THEN
      PRINT src; " is in the table (index"; STR$(i%); ")"
      EXIT SUB
    END IF
  NEXT i%
  PRINT src; " isn't in the table"
END SUB

Linear searches work. In fact, even on slow computers, it's sufficient for small datasets of maybe 10 to 20 entries. However, this method shows its weakness when attempting to parse larger arrays. The core issue is not so much how long it takes to compare two entries, but rather the amount of time we spend comparing stuff we shouldn't be comparing. Can we do better? I mean as a specie. Seriously people. I'm looking and judging right now. Looking and judging in a very condescending manner.

Binary search

Behind the barbaric name hides a clever technique to search words in alphabetically ordered lists by simply halving the length of the list after each comparition! Basically, we keep track of the lower and upper boundaries of the list, then we pick the entry located right in the middle. If our query is supposed to appear before that entry, then we set the upper boundary to the current index minus 1, then pick the entry right in the middle again... if the query is supposed to be after the entry we got, then we set the lower boundary to the current index plus 1, then split in the middle again... and we keep going until there are no more entries left to check or as soon as our query matches the entry. It sounds laborious, but it's really fast! Here's what it looks like:

DECLARE FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)

DIM lst(31) AS STRING, lstSize AS INTEGER
DIM tmpStr AS STRING, tmpInt AS INTEGER

' setup list
DO
  READ tmpStr
  IF LEN(tmpStr) THEN
    lst(lstSize) = tmpStr
    lstSize = lstSize + 1
  END IF
LOOP WHILE (LEN(tmpStr) > 0) AND (lstSize <= UBOUND(lst))

' testing for "Ruth"
tmpInt = inList%("Ruth", lst(), lstSize)
IF (tmpInt < 0) THEN
  PRINT "Ruth no found"
ELSE
  PRINT "Ruth found at index"; tmpInt
END IF

' testing for "Goliath"
tmpInt = inList%("Goliath", lst(), lstSize)
IF (tmpInt < 0) THEN
  PRINT "Goliath no found"
ELSE
  PRINT "Goliath found at index"; tmpInt
END IF

' our list (in alphabetical order!)
DATA "Brenda","Brendon","Burt","Dylan","Goldie","Jeremy","Jill","Joan"
DATA "Juan","Karen","Michael","Mindy","Ruth","Sophia","William",""

' binary search (case-insensitive)
FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)
  DIM loBound AS INTEGER, hiBound AS INTEGER, middle AS INTEGER
  DIM tmpStr AS STRING

  loBound = 0
  ' lstSize is the length in elements, not the upper boundary!
  hiBound = lstSize - 1
  tmpStr = LCASE$(search)

  DO WHILE (loBound <= hiBound)
    middle = (loBound + hiBound) \ 2
    SELECT CASE LCASE$(lst(middle))
    CASE IS < tmpStr
      loBound = middle + 1
    CASE IS > tmpStr
      hiBound = middle - 1
    CASE ELSE
      inList% = middle
      EXIT FUNCTION
    END SELECT
  LOOP

  inList% = -1
END FUNCTION

Of course, like said previously, this only works if the list is properly sorted beforehand. Adding new elements to a sorted list will require a modified version of the routine, this time returning the best location for the entry to be inserted to: bestSpot would be middle + 1 if tmpStr is greater, and would be middle if tmpStr is lower, keep going until the routine terminates after depleting the whole list... by the way, if you're looking for examples of sorting algorithms, there are few over there.

Hash tables

We usually know where to look for things when we need them. If we want fresh food, we check the fridge. If we want canned food, we search the pantry. Books are usually found in bookshelves, clothes in wardrobes, etc. In short, we categorize and store stuff in such a way we know where to look for them rather than engage in archeological excavations through the whole house, starting in the cellar, all the way up to the attic, opening every drawer in the process.

Now, computers, they don't really understand the beauty of a frozen pizza patiently waiting for you in a freezer, but they do understand numbers. So all we have to do is generate a number (hash key) out of the string "pizza" and use it in place of a freezer (hash table entry;) then we place that freezer somewhere in the house (a large array called hash table.) It's a terrible analogy because we're not actually teaching a computer what frozen food is, but the principle is sound: catalog things in a manner we can determine where it is located without having to search the whole array. In other words, when confronted with a random input, using our cataloguing system we can determine where that item ought to be stored, and thus, if it has (or not) been stored there.

DECLARE FUNCTION getHash%(src AS STRING)

DIM myWord AS STRING

myWord = "pizza"
PRINT "The hash for "; CHR$(34); myWord; CHR$(34); " is"; getHash%(myWord)

' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
  DIM sum AS INTEGER

  FOR i% = 1 TO LEN(src)
    sum = sum + ASC(MID$(src, i%, 1))
  NEXT i%
  getHash% = sum
END FUNCTION

Two things about that function: first, the longer the word is, the larger the hash key will be. This means that eventually we'll crash head-first into an overflow... besides, we have to use that hash key as an index pointing to a slot into an array, so we'll have to cap the hash key to make sure it never goes out of boundaries. Second, this formula tends to generate collisions. A collision is when different strings produce the same hash key. Keep in mind that regardless of how good the hashing function is and how large the array is, this technique will produce collisions. To solve the issue, we have to allow multiple items to be stored at the same location. Let's get to it.

DECLARE FUNCTION getHash%(src AS STRING)

' size of the hash table
CONST hashTableSize = 80

' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER

' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING

' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER

' read words from DATA statements and insert into the word table
DO
  READ myWord
  IF (LEN(myWord) = 0) THEN EXIT DO
  wordTable(numWords) = myWord
  numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))

' generate hash keys for each word
FOR i% = 0 TO numWords - 1
  myHashKey = getHash%(wordTable(i%)) ' hash key for this word
  hashTable(myHashKey) = hashTable(myHashKey) + MKI$(i%) ' another word with that hash...
NEXT i%

' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts",""

' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
  DIM sum AS INTEGER

  FOR i% = 1 TO LEN(src)
    sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
  NEXT i%
  getHash% = sum
END FUNCTION

I think most of the code is self-explanatory, except maybe the changes made to the getHash% function. In short, we force upper-case letters to lower-case (OR 32,) subtract 97 so (a = 0 and z = 25,) then we multiply the value of the letter by its position multiplied by 26 (since there are 26 letters in the alphabet -- it's similar to how the leftmost "1" in "eleven" is worth 10, while the rightmost "1" in "eleven" is only worth 1.) Finally, we use the modulo operator to keep the value in check by returning the remainder of a whole division. It'll work properly as long as hashTableSize doesn't share a common divisor with 26.

Okay, now what? Well, now we can pool our data and see if a word already exists in a list. For the following example, the hash table has been reduced to only 17 entries while the list has been extended to 32 words. Ideally, we should give the array a prime number of slots to increase the odds of an evenly repartition of entries. As a reminder, a prime number is a number that only has two strictly unique divisors: one and itself.

DECLARE FUNCTION getHash%(src AS STRING)

' size of the hash table
CONST hashTableSize = 17

' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER

' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING

' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER, wordIds AS STRING
DIM wordFound AS INTEGER

' generate word list and hash table
DO
  READ myWord
  IF (LEN(myWord) = 0) THEN EXIT DO
  ' append word to list
  wordTable(numWords) = myWord
  ' generate hash key for this word
  myHashKey = getHash%(myWord)
  ' add word index from the wordTable() into the hash table
  hashTable(myHashKey) = hashTable(myHashKey) + MKI$(numWords)
  ' and that's a wrap...
  numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))

' display stuff on screen
CLS
PRINT "wordTable()"
FOR i% = 0 TO numWords - 1
  PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " "; wordTable(i%); SPACE$(16 - LEN(wordTable(i%)));
NEXT i%: PRINT : PRINT
PRINT "hashTable()"
FOR i% = 0 TO hashTableSize - 1
  PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " ";
  wordIds = hashTable(i%)
  FOR j% = 1 TO LEN(wordIds) STEP 2
    PRINT CVI(MID$(wordIds, j%, 2));
  NEXT j%
  IF (POS(0) < 40) THEN
    PRINT SPACE$(41 - POS(0));
  ELSE
    PRINT
  END IF
NEXT i%: PRINT : PRINT

' prompt user
DO
  LOCATE 22, 1: PRINT SPACE$(160);
  LOCATE 22, 1: INPUT "Word: ", myWord
  IF (LEN(myWord) = 0) THEN EXIT DO
  wordFound = -1

  ' generate hash key for the word
  myHashKey = getHash%(myWord)

  ' see if there's something in the hash table
  wordIds = hashTable(myHashKey)
  ' search all wordTable indices registered under that hash
  FOR i% = 1 TO LEN(wordIds) STEP 2
    IF LCASE$(wordTable(CVI(MID$(wordIds, i%, 2)))) = LCASE$(myWord) THEN
      wordFound = CVI(MID$(wordIds, i%, 2))
      EXIT FOR
    END IF
  NEXT i%

  ' summary
  PRINT CHR$(34); myWord; CHR$(34); " ";
  IF (wordFound = -1) THEN
    PRINT "not found"
  ELSE
    PRINT "found: wordTable("; LTRIM$(STR$(wordFound)); "), via ";
    PRINT "hashTable("; LTRIM$(STR$(myHashKey)); ")"
  END IF
  DO: LOOP UNTIL LEN(INKEY$)
LOOP

' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts"
DATA "knife","computer","clock","phone","light","water","soda","hat"
DATA "wallet","bucket","bin","calendar","broom","saw","scissors","guitar"
DATA "raft","tree","bottle","bed","tv","rope","door","stairs",""

' generate a hash from a simple character string
FUNCTION getHash% (src AS STRING)
  DIM sum AS INTEGER

  FOR i% = 1 TO LEN(src)
    sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
  NEXT i%
  getHash% = sum
END FUNCTION

It's very fast because we essentially break down a large list into smaller lists we can search linearly, and the hash key tells us which list we have to search in order to retrieve what we want. This technique can search lists of over 1,000 items in no time, even on a 8Mhz! You can test it out by writing your own code using a list of common english words (there's a much larger list over here.) Just remember that you need a good algorithm that returns evenly distributed hash keys, and a hash table that is large enough to reduce the odds of multiple entries being located at the same spot.

If you were to work with very large datasets straight from files, the disk device will be a bottleneck. Try to combine both a hash table and a binary search: in case of collision, insert new ids into the list by alphabetical order, so a binary search could be used in place of a linear search. When looking for variable-length structures in a file, create an extra table that would point to the offset of each entry... the possibilities are endless!