There's no running away from it. One day or another, you will lose sleep over the death-defying question: how can I retrieve a specific entry in a large dataset? The simplest solution to the problem is a straightforward linear search in which every entry is compared, sequentially, starting from the bottom of the pile, all the way to the top until we find what we need. It looks like this:
DECLARE SUB inList (src AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM tmpStr AS STRING, lst(31) AS STRING, lstSize AS INTEGER
' setup list
DO
READ tmpStr
IF LEN(tmpStr) THEN
lst(lstSize) = tmpStr
lstSize = lstSize + 1
END IF
LOOP UNTIL (LEN(tmpStr) = 0) OR (lstSize > UBOUND(lst))
' do a linear search
inList "Brendon", lst(), lstSize
' list
DATA "Ruth","Karen","Brendon","Dylan","Jeremy","Sophia","Mindy","Joan"
DATA "Jill","Goldie","Brenda","Burt","Michael","Juan","William",""
' linear search
SUB inList (src AS STRING, lst() AS STRING, lstSize AS INTEGER)
FOR i% = 0 TO lstSize - 1
IF (lst(i%) = src) THEN
PRINT src; " is in the table (index"; STR$(i%); ")"
EXIT SUB
END IF
NEXT i%
PRINT src; " isn't in the table"
END SUB
Linear searches work. In fact, even on slow computers, it's sufficient for small datasets of maybe 10 to 20 entries. However, this method shows its weakness when attempting to parse larger arrays. The core issue is not so much how long it takes to compare two entries, but rather the amount of time we spend comparing stuff we shouldn't be comparing. Can we do better? I mean as a specie. Seriously people. I'm looking and judging right now. Looking and judging in a very condescending manner.
Behind the barbaric name hides a clever technique to search words in alphabetically ordered lists by simply halving the length of the list after each comparition! Basically, we keep track of the lower and upper boundaries of the list, then we pick the entry located right in the middle. If our query is supposed to appear before that entry, then we set the upper boundary to the current index minus 1, then pick the entry right in the middle again... if the query is supposed to be after the entry we got, then we set the lower boundary to the current index plus 1, then split in the middle again... and we keep going until there are no more entries left to check or as soon as our query matches the entry. It sounds laborious, but it's really fast! Here's what it looks like:
DECLARE FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM lst(31) AS STRING, lstSize AS INTEGER
DIM tmpStr AS STRING, tmpInt AS INTEGER
' setup list
DO
READ tmpStr
IF LEN(tmpStr) THEN
lst(lstSize) = tmpStr
lstSize = lstSize + 1
END IF
LOOP WHILE (LEN(tmpStr) > 0) AND (lstSize <= UBOUND(lst))
' testing for "Ruth"
tmpInt = inList%("Ruth", lst(), lstSize)
IF (tmpInt < 0) THEN
PRINT "Ruth no found"
ELSE
PRINT "Ruth found at index"; tmpInt
END IF
' testing for "Goliath"
tmpInt = inList%("Goliath", lst(), lstSize)
IF (tmpInt < 0) THEN
PRINT "Goliath no found"
ELSE
PRINT "Goliath found at index"; tmpInt
END IF
' our list (in alphabetical order!)
DATA "Brenda","Brendon","Burt","Dylan","Goldie","Jeremy","Jill","Joan"
DATA "Juan","Karen","Michael","Mindy","Ruth","Sophia","William",""
' binary search (case-insensitive)
FUNCTION inList% (search AS STRING, lst() AS STRING, lstSize AS INTEGER)
DIM loBound AS INTEGER, hiBound AS INTEGER, middle AS INTEGER
DIM tmpStr AS STRING
loBound = 0
' lstSize is the length in elements, not the upper boundary!
hiBound = lstSize - 1
tmpStr = LCASE$(search)
DO WHILE (loBound <= hiBound)
middle = (loBound + hiBound) \ 2
SELECT CASE LCASE$(lst(middle))
CASE IS < tmpStr
loBound = middle + 1
CASE IS > tmpStr
hiBound = middle - 1
CASE ELSE
inList% = middle
EXIT FUNCTION
END SELECT
LOOP
inList% = -1
END FUNCTION
Of course, like said previously, this only works if the list is properly sorted beforehand. Adding new elements to a sorted list will require a modified version of the routine, this time returning the best location for the entry to be inserted to: bestSpot would be middle + 1 if tmpStr is greater, and would be middle if tmpStr is lower, keep going until the routine terminates after depleting the whole list... by the way, if you're looking for examples of sorting algorithms, there are few over there.
We usually know where to look for things when we need them. If we want fresh food, we check the fridge. If we want canned food, we search the pantry. Books are usually found in bookshelves, clothes in wardrobes, etc. In short, we categorize and store stuff in such a way we know where to look for them rather than engage in archeological excavations through the whole house, starting in the cellar, all the way up to the attic, opening every drawer in the process.
Now, computers, they don't really understand the beauty of a frozen pizza patiently waiting for you in a freezer, but they do understand numbers. So all we have to do is generate a number (hash key) out of the string "pizza" and use it in place of a freezer (hash table entry;) then we place that freezer somewhere in the house (a large array called hash table.) It's a terrible analogy because we're not actually teaching a computer what frozen food is, but the principle is sound: catalog things in a manner we can determine where it is located without having to search the whole array. In other words, when confronted with a random input, using our cataloguing system we can determine where that item ought to be stored, and thus, if it has (or not) been stored there.
DECLARE FUNCTION getHash%(src AS STRING)
DIM myWord AS STRING
myWord = "pizza"
PRINT "The hash for "; CHR$(34); myWord; CHR$(34); " is"; getHash%(myWord)
' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = sum + ASC(MID$(src, i%, 1))
NEXT i%
getHash% = sum
END FUNCTION
Two things about that function: first, the longer the word is, the larger the hash key will be. This means that eventually we'll crash head-first into an overflow... besides, we have to use that hash key as an index pointing to a slot into an array, so we'll have to cap the hash key to make sure it never goes out of boundaries. Second, this formula tends to generate collisions. A collision is when different strings produce the same hash key. Keep in mind that regardless of how good the hashing function is and how large the array is, this technique will produce collisions. To solve the issue, we have to allow multiple items to be stored at the same location. Let's get to it.
DECLARE FUNCTION getHash%(src AS STRING)
' size of the hash table
CONST hashTableSize = 80
' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER
' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING
' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER
' read words from DATA statements and insert into the word table
DO
READ myWord
IF (LEN(myWord) = 0) THEN EXIT DO
wordTable(numWords) = myWord
numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))
' generate hash keys for each word
FOR i% = 0 TO numWords - 1
myHashKey = getHash%(wordTable(i%)) ' hash key for this word
hashTable(myHashKey) = hashTable(myHashKey) + MKI$(i%) ' another word with that hash...
NEXT i%
' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts",""
' generate a hash from a simple character string
FUNCTION getHash%(src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
NEXT i%
getHash% = sum
END FUNCTION
I think most of the code is self-explanatory, except maybe the changes made to the getHash% function. In short, we force upper-case letters to lower-case (OR 32,) subtract 97 so (a = 0 and z = 25,) then we multiply the value of the letter by its position multiplied by 26 (since there are 26 letters in the alphabet -- it's similar to how the leftmost "1" in "eleven" is worth 10, while the rightmost "1" in "eleven" is only worth 1.) Finally, we use the modulo operator to keep the value in check by returning the remainder of a whole division. It'll work properly as long as hashTableSize doesn't share a common divisor with 26.
Okay, now what? Well, now we can pool our data and see if a word already exists in a list. For the following example, the hash table has been reduced to only 17 entries while the list has been extended to 32 words. Ideally, we should give the array a prime number of slots to increase the odds of an evenly repartition of entries. As a reminder, a prime number is a number that only has two strictly unique divisors: one and itself.
DECLARE FUNCTION getHash%(src AS STRING)
' size of the hash table
CONST hashTableSize = 17
' list of unique words
DIM wordTable(999) AS STRING, numWords AS INTEGER
' hash table, where hashes and words are associated
DIM hashTable(hashTableSize - 1) AS STRING
' generic variable stuff
DIM myWord AS STRING, myHashKey AS INTEGER, wordIds AS STRING
DIM wordFound AS INTEGER
' generate word list and hash table
DO
READ myWord
IF (LEN(myWord) = 0) THEN EXIT DO
' append word to list
wordTable(numWords) = myWord
' generate hash key for this word
myHashKey = getHash%(myWord)
' add word index from the wordTable() into the hash table
hashTable(myHashKey) = hashTable(myHashKey) + MKI$(numWords)
' and that's a wrap...
numWords = numWords + 1
LOOP UNTIL (numWords > UBOUND(wordTable))
' display stuff on screen
CLS
PRINT "wordTable()"
FOR i% = 0 TO numWords - 1
PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " "; wordTable(i%); SPACE$(16 - LEN(wordTable(i%)));
NEXT i%: PRINT : PRINT
PRINT "hashTable()"
FOR i% = 0 TO hashTableSize - 1
PRINT RIGHT$("00" + LTRIM$(STR$(i%)), 3); " ";
wordIds = hashTable(i%)
FOR j% = 1 TO LEN(wordIds) STEP 2
PRINT CVI(MID$(wordIds, j%, 2));
NEXT j%
IF (POS(0) < 40) THEN
PRINT SPACE$(41 - POS(0));
ELSE
PRINT
END IF
NEXT i%: PRINT : PRINT
' prompt user
DO
LOCATE 22, 1: PRINT SPACE$(160);
LOCATE 22, 1: INPUT "Word: ", myWord
IF (LEN(myWord) = 0) THEN EXIT DO
wordFound = -1
' generate hash key for the word
myHashKey = getHash%(myWord)
' see if there's something in the hash table
wordIds = hashTable(myHashKey)
' search all wordTable indices registered under that hash
FOR i% = 1 TO LEN(wordIds) STEP 2
IF LCASE$(wordTable(CVI(MID$(wordIds, i%, 2)))) = LCASE$(myWord) THEN
wordFound = CVI(MID$(wordIds, i%, 2))
EXIT FOR
END IF
NEXT i%
' summary
PRINT CHR$(34); myWord; CHR$(34); " ";
IF (wordFound = -1) THEN
PRINT "not found"
ELSE
PRINT "found: wordTable("; LTRIM$(STR$(wordFound)); "), via ";
PRINT "hashTable("; LTRIM$(STR$(myHashKey)); ")"
END IF
DO: LOOP UNTIL LEN(INKEY$)
LOOP
' list of random stuff
DATA "pizza","yogurt","pasta","chicken","soup","shoes","books","shirts"
DATA "knife","computer","clock","phone","light","water","soda","hat"
DATA "wallet","bucket","bin","calendar","broom","saw","scissors","guitar"
DATA "raft","tree","bottle","bed","tv","rope","door","stairs",""
' generate a hash from a simple character string
FUNCTION getHash% (src AS STRING)
DIM sum AS INTEGER
FOR i% = 1 TO LEN(src)
sum = (sum + (((ASC(MID$(src, i%, 1)) OR 32) - 97) * (i% * 26))) MOD hashTableSize
NEXT i%
getHash% = sum
END FUNCTION
It's very fast because we essentially break down a large list into smaller lists we can search linearly, and the hash key tells us which list we have to search in order to retrieve what we want. This technique can search lists of over 1,000 items in no time, even on a 8Mhz! You can test it out by writing your own code using a list of common english words (there's a much larger list over here.) Just remember that you need a good algorithm that returns evenly distributed hash keys, and a hash table that is large enough to reduce the odds of multiple entries being located at the same spot.
If you were to work with very large datasets straight from files, the disk device will be a bottleneck. Try to combine both a hash table and a binary search: in case of collision, insert new ids into the list by alphabetical order, so a binary search could be used in place of a linear search. When looking for variable-length structures in a file, create an extra table that would point to the offset of each entry... the possibilities are endless!