There's no running away from it. One day or another, you will lose sleep over the death-defying question: how can I retrieve a specific entry in a large dataset? The simplest solution to the problem is a straightforward linear search in which every entry is compared, sequentially, starting from the bottom of the pile, all the way to the top until we find what we need. It looks like this:
DECLARE FUNCTION inList% (src AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM lst(31) AS STRING, lstSize AS INTEGER
DIM tmpStr AS STRING, iIndex AS INTEGER
' Build list from data statements.
DO
READ tmpStr
IF LEN(tmpStr) THEN
lst(lstSize) = tmpStr
lstSize = lstSize + 1
END IF
LOOP UNTIL (LEN(tmpStr) = 0) OR (lstSize > UBOUND(lst))
' Is Brendon in the list?
tmpStr = "Brendon"
iIndex = inList%(tmpStr, lst(), lstSize)
IF (iIndex = -1) THEN
PRINT tmpStr; " no found"
ELSE
PRINT tmpStr; " found at index"; iIndex
END IF
DATA "Ruth","Karen","Brendon","Dylan","Jeremy","Sophia","Mindy","Joan"
DATA "Jill","Goldie","Brenda","Burt","Michael","Juan","William",""
' Perform a linear search and return the index of the requested entry, or -1
' if there's no match.
FUNCTION inList% (src AS STRING, lst() AS STRING, lstSize AS INTEGER)
' Assume the entry doesn't exist.
inList% = -1
' Go through the list, one element at a time.
FOR i% = 0 TO lstSize - 1
' Do we have a match?
IF (lst(i%) = src) THEN
' Yes we do! Remember the index and stop searching.
inList% = i%
EXIT FOR
END IF
NEXT i%
END FUNCTION
On very old computers, linear searches are sufficient for small datasets of maybe 10 to 20 entries. However, this method shows its weakness when attempting to parse larger arrays. The core issue is not so much how long it takes to compare two entries, but rather the amount of time we spend comparing stuff we shouldn't be comparing. Can we do better? I mean as a specie. Seriously people. I'm looking and judging right now. Looking and judging in a very condescending manner.
For those who don't remember ye olden days, we used to have a thing made of "paper" that we called "dictionary." The "book" contained the "definition" or "meaning" of words, and each entry was sorted alphabetically, meaning that words beginning with the letter "A" would appear before words beginning with the letter "Z." Sorry for bombarding you with regressive beliefs that have no place in our enlightned society, but the core concept of alphabetically sorted lists birthed a really neat search algorithm: the Binary Search.
Let's say we have a list of 1000 alphabetically-sorted names or words (it also works with numbers) and we're looking for a very specific entry. Say "Velma." Instead of starting at entry 1 and work our way to entry 1000, we're going to split the list in half and look right in the middle, at entry 500. The entry might be "Lois." Since the list is ordered, if Velma does exist, then it should appear after Lois. It means that we don't have to check entries 1-500 anymore and we only have 500 entries left (501 to 1000.) We split right in the middle at index 750 and check again: the entry is "Steven." We know that Velma should appear after Steven, so, following the same reasoning, we can skip entries 501 to 750 and focus our efforts on the remaining 250. That's 75% of the list discarded and we're only two iterations deep!
The aglorithm is super easy to implement: first, we set two markers (for the lower and upper bounds of the list,) then we determine the index of the entry located right between the two (the sum of the lower and upper boundaries divided by two.) If the query should appear before the mid-entry, we set the upper bound to the mid-entry index minus 1. If the query should appear after the mid-entry, we set the lower bound to the mid-entry index plus 1. Rince and repeat until we find a match, or until the whole list has been processed. Here's what it looks like:
DECLARE FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM lst(31) AS STRING, lstSize AS INTEGER
DIM tmpStr AS STRING, iIndex AS INTEGER
' Build list from data statements.
DO
READ tmpStr
IF LEN(tmpStr) THEN
lst(lstSize) = tmpStr
lstSize = lstSize + 1
END IF
LOOP UNTIL (LEN(tmpStr) = 0) OR (lstSize > UBOUND(lst))
' Is Ruth in the list?
tmpStr = "Ruth"
iIndex = inList%(tmpStr, lst(), lstSize)
IF (iIndex = -1) THEN
PRINT tmpStr; " no found"
ELSE
PRINT tmpStr; " found at index"; iIndex
END IF
' Is Goliath in the list?
tmpStr = "Goliath"
iIndex = inList%(tmpStr, lst(), lstSize)
IF (iIndex = -1) THEN
PRINT tmpStr; " no found"
ELSE
PRINT tmpStr; " found at index"; iIndex
END IF
' List data (in alphabetical order!)
DATA "Brenda","Brendon","Burt","Dylan","Goldie","Jeremy","Jill","Joan"
DATA "Juan","Karen","Michael","Mindy","Ruth","Sophia","William",""
' Perform a binary search and return the index of the requested entry, or -1
' if there's no match.
FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM iFloor AS INTEGER, iCeiling AS INTEGER, iMiddle AS INTEGER
' Assume the entry doesn't exist.
inList% = -1
' Set the lower and upper boundaries of the local list. Please note that
' lstSize is the length of the array in entries, not the upper boundary!
iFloor = 0
iCeiling = lstSize - 1
' Loop while the local list is at least 1 entry.
DO WHILE (iFloor <= iCeiling)
' Get the index located right in the middle of the local list.
iMiddle = (iFloor + iCeiling) \ 2
' Compare entry with the query.
SELECT CASE lst(iMiddle)
CASE IS < search
iFloor = iMiddle + 1
CASE IS > search
iCeiling = iMiddle - 1
CASE ELSE
' We got it! Neat.
inList% = iMiddle
EXIT DO
END SELECT
LOOP
END FUNCTION
Of course this only works if the list is properly sorted beforehand. If the list is static, it may be sorted once at runtime (you can find a few sorting algorithms here.) If new entries may be inserted later, the function above can return the ideal (ordered) insertion spot: bestSpot would be iMiddle + 1 if tmpStr is greater, and would be iMiddle if tmpStr is lower. However, sometimes, the array has a very specific order that must be maintained at all cost but unfortunately doesn't align with our searching needs, in which case we have to work with a second array:
DECLARE FUNCTION inList% (search AS STRING, lst() AS STRING, lstAlpha() AS INTEGER, lstSize AS INTEGER)
DIM lst(31) AS STRING, lstAlpha(31) AS INTEGER, lstSize AS INTEGER
DIM tmpStr AS STRING, iIndex AS INTEGER, doAgain AS INTEGER, max AS INTEGER
' Build list from data statements.
DO
READ tmpStr
IF LEN(tmpStr) THEN
lst(lstSize) = tmpStr
lstAlpha(lstSize) = lstSize ' Preserve index while we're here.
lstSize = lstSize + 1
END IF
LOOP UNTIL (LEN(tmpStr) = 0) OR (lstSize > UBOUND(lst))
' Bubble sort lst() alphabetically on lstAlpha()
max = lstSize - 1
DO
doAgain = 0
FOR i% = 0 TO max - 1
IF (lst(lstAlpha(i%)) > lst(lstAlpha(i% + 1))) THEN
SWAP lstAlpha(i%), lstAlpha(i% + 1)
doAgain = -1
END IF
NEXT i%
max = max - 1
LOOP WHILE doAgain
' Is Burt in the list?
tmpStr = "Burt"
iIndex = inList%(tmpStr, lst(), lstAlpha(), lstSize)
IF (iIndex = -1) THEN
PRINT tmpStr; " no found"
ELSE
PRINT tmpStr; " found at index"; iIndex
END IF
' Is Rudolph in the list?
tmpStr = "Rudolph"
iIndex = inList%(tmpStr, lst(), lstAlpha(), lstSize)
IF (iIndex = -1) THEN
PRINT tmpStr; " no found"
ELSE
PRINT tmpStr; " found at index"; iIndex
END IF
' List data (any order)
DATA "Ruth","Karen","Brendon","Dylan","Jeremy","Sophia","Mindy","Joan"
DATA "Jill","Goldie","Brenda","Burt","Michael","Juan","William",""
' Perform a binary search and return the index of the requested entry, or -1
' if there's no match. lstAlpha() contains the index of each name in lst(),
' sorted alphabetically.
FUNCTION inList% (search AS STRING, lst() AS STRING, lstAlpha() AS INTEGER, lstSize AS INTEGER)
DIM iFloor AS INTEGER, iCeiling AS INTEGER, iMiddle AS INTEGER
' Assume the entry doesn't exist.
inList% = -1
' Set the lower and upper boundaries of the local list. Please note that
' lstSize is the length of the array in entries, not the upper boundary!
iFloor = 0
iCeiling = lstSize - 1
' Loop while the local list is at least 1 entry.
DO WHILE (iFloor <= iCeiling)
' Get the index located right in the middle of the local list.
iMiddle = (iFloor + iCeiling) \ 2
' Compare entry with the query.
SELECT CASE lst(lstAlpha(iMiddle))
CASE IS < search
iFloor = iMiddle + 1
CASE IS > search
iCeiling = iMiddle - 1
CASE ELSE
' That's the one! Kind reminder that iMiddle is the
' index in lstAlpha(), not in lst().
inList% = lstAlpha(iMiddle)
EXIT DO
END SELECT
LOOP
END FUNCTION
This one is really simple too! Essentially, we convert a string into a numeric value that matches a sub-list, which should contain the string. For instance, we're going to dispatch strings in different lists according to their first character. We're going to build 27 lists (26 for words starting by any letter from A to Z, and an extra list for words that start with anything else, like numbers or special characters.) Then, we write a "hash" function as follows:
DECLARE FUNCTION getHash%(src AS STRING)
PRINT "Hula hoop goes to list "; getHash%("Hula hoop")
FUNCTION getHash%(src AS STRING)
DIM first AS INTEGER
' Take the first character of the string, make it
' uppercase and get its ASCII value.
first = ASC(UCASE$(LEFT$(src, 1)))
' Cap value
IF ((first >= 65) AND (first <= 90)) THEN
getHash% = first - 65 ' 0 to 25
ELSE
getHash% = 26
END IF
END FUNCTION
Okay, but what's the point of all that? Usually, we know where to look for things when we need them. If we want fresh food, we check the fridge. If we want canned food, we search the pantry. Books are usually found in bookshelves, clothes in wardrobes, etc. In short, we put away things in such a way we know where to look for them rather than engage in archeological excavations through the whole house, starting in the cellar, all the way up to the attic, opening every drawer in the process. We got a fork? Put it in the kitchen's drawer. We got a t-shirt? Put it in the wardrobe. And when we need that t-shirt? Well, we know were we would put it, thus it has to be there.
The thing is that we don't actually need a "logical" place to put our strings, we just a need a reliable way to find a place to store them. By reliable, we mean a formula that will always provide the exact same result when we provide the same string. The previous hash function we wrote works, but it's a little lackluster because we want our strings to be evenly distributed among multiple arrays if possible. To do that, we may increment the number of characters we process to compute the hash:
DECLARE FUNCTION getHash%(src AS STRING)
DIM myWord AS STRING
myWord = "pizza"
PRINT "The hash for "; CHR$(34); myWord; CHR$(34); " is"; getHash%(myWord)
' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = sum + ASC(MID$(src, i%, 1))
NEXT i%
getHash% = sum
END FUNCTION
Two things about that function: first, the longer the word is, the larger the hash key will be. This means that eventually we'll crash head-first into an overflow... besides, we have to use that hash key as an index pointing to a slot into an array, so we'll have to cap the hash key to make sure it never goes out of boundaries. Second, this formula tends to generate collisions. A collision is when different strings produce the same hash key. Keep in mind that regardless of how good the hashing function is and how large the array is, this technique will produce collisions. To solve the issue, we have to allow multiple items to be stored at the same location. Let's get to it.
DECLARE FUNCTION getHash%(src AS STRING)
' size of the hash table
CONST hashTableSize = 80
' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER
' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING
' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER
' read words from DATA statements and insert into the word table
DO
READ myWord
IF (LEN(myWord) = 0) THEN EXIT DO
wordTable(numWords) = myWord
numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))
' generate hash keys for each word
FOR i% = 0 TO numWords - 1
myHashKey = getHash%(wordTable(i%)) ' hash key for this word
hashTable(myHashKey) = hashTable(myHashKey) + MKI$(i%) ' another word with that hash...
NEXT i%
' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts",""
' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
NEXT i%
getHash% = sum
END FUNCTION
First, we load all words into an array. The array is unsorted and we can't really find its content easily. Then, for each word, we generate its hash key. We use the hash key to access one of the many hashTable() strings, where we'll append the index (in wordTable()) of the word. From there, we can test for the existence of a string by computing its hash, then look into the matching hashTable() list for all the indices. We go through each index, one by one, until one of them match the string we were looking for. If there's no match, then the string is not yet registered.
For the following example, the hash table has been reduced to only 17 entries while the list has been extended to 32 words. Ideally, we should give the array a prime number of slots to increase the odds of an evenly repartition of entries. As a reminder, a prime number is a number that has two divisors: one and itself (meaning that 1 is not a prime number.)
DECLARE FUNCTION getHash%(src AS STRING)
' size of the hash table
CONST hashTableSize = 17
' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER
' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING
' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER, wordIds AS STRING
DIM wordFound AS INTEGER
' generate word list and hash table
DO
READ myWord
IF (LEN(myWord) = 0) THEN EXIT DO
' append word to list
wordTable(numWords) = myWord
' generate hash key for this word
myHashKey = getHash%(myWord)
' add word index from the wordTable() into the hash table
hashTable(myHashKey) = hashTable(myHashKey) + MKI$(numWords)
' and that's a wrap...
numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))
' display stuff on screen
CLS
PRINT "wordTable()"
FOR i% = 0 TO numWords - 1
PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " "; wordTable(i%); SPACE$(16 - LEN(wordTable(i%)));
NEXT i%: PRINT: PRINT
PRINT "hashTable()"
FOR i% = 0 TO hashTableSize - 1
PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " ";
wordIds = hashTable(i%)
FOR j% = 1 TO LEN(wordIds) STEP 2
PRINT CVI(MID$(wordIds, j%, 2));
NEXT j%
IF (POS(0) < 40) THEN
PRINT SPACE$(41 - POS(0));
ELSE
PRINT
END IF
NEXT i%: PRINT: PRINT
' prompt user
DO
LOCATE 22, 1: PRINT SPACE$(160);
LOCATE 22, 1: INPUT "Word: ", myWord
IF (LEN(myWord) = 0) THEN EXIT DO
wordFound = -1
' generate hash key for the word
myHashKey = getHash%(myWord)
' see if there's something in the hash table
wordIds = hashTable(myHashKey)
' search all wordTable indices registered under that hash
FOR i% = 1 TO LEN(wordIds) STEP 2
IF LCASE$(wordTable(CVI(MID$(wordIds, i%, 2)))) = LCASE$(myWord) THEN
wordFound = CVI(MID$(wordIds, i%, 2))
EXIT FOR
END IF
NEXT i%
' summary
PRINT CHR$(34); myWord; CHR$(34); " ";
IF (wordFound = -1) THEN
PRINT "not found"
ELSE
PRINT "found: wordTable("; LTRIM$(STR$(wordFound)); "), via ";
PRINT "hashTable("; LTRIM$(STR$(myHashKey)); ")"
END IF
DO: LOOP UNTIL LEN(INKEY$)
LOOP
' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts"
DATA "knife","computer","clock","phone","light","water","soda","hat"
DATA "wallet","bucket","bin","calendar","broom","saw","scissors","guitar"
DATA "raft","tree","bottle","bed","tv","rope","door","stairs",""
' generate a hash from a simple character string
FUNCTION getHash% (src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
NEXT i%
getHash% = sum
END FUNCTION
It's very fast because we essentially break down a large list into smaller lists we can search linearly, and the hash key tells us which list we have to search in order to retrieve what we want. This technique can search lists of over 1,000 items in no time, even on a 8Mhz! You can test it out by writing your own code using a list of common english words (there's a much larger list over here.) Just remember that you need a good algorithm that returns evenly distributed hash keys, and a hash table that is large enough to reduce the odds of multiple entries being located at the same spot.
If you were to work with very large datasets straight from files, the disk device will be a bottleneck. Try to combine both a hash table and a binary search: in case of collision, insert new ids into the list by alphabetical order, so a binary search could be used in place of a linear search. When looking for variable-length structures in a file, create an extra table that would point to the offset of each entry... the possibilities are endless!