Your friendly neighbor

Let's do something simple this time. Something... BASIC shall we say. Let's talk about INSTR!

INSTR is a function that retrieves a sub-string in a longer string and returns the location of the first byte of the sub-string. It exists in two flavors. The first flavor only takes two parameters: the string that may (or not) contain the sub-string, and the sub-string to match.

' Returns the position of the first "o" found in "Hello World",
' starting from the very beginning of the string. Returns 5.
PRINT INSTR("Hello World", "o")

' The function returns 0, because "Hello World" doesn't contain
' the character "E"; the function is case-sensitive.
PRINT INSTR("Hello World", "E")

The overloaded version of INSTR takes three parameters: the starting offset, the main string, and the sub-string:

' Returns the position of the first "o" in "Hello World". This
' time we explicitly provide the starting position for the
' search. The result is still 5.
PRINT INSTR(1, "Hello World", "o")

' Returns the position of the first "o" in "Hello World",
' starting from the 6th character of the string. This time, the
' function returns 8.
PRINT INSTR(6, "Hello World", "o")

' As a quick note: looking for an empty string will always
' return the starting offset, unless the starting offset is
' out of boundaries. Thus, this call right here returns 3.
PRINT INSTR(3, "Hello World", "")

Reverse in-string

INSTR is usually invoked when splitting a string in two. For instance, we may extract the extension off a filename string with the following code:

CONST cChrQuote = 34

DIM sFileName AS STRING, sName AS STRING, sExt AS STRING
DIM iOffset AS INTEGER

' Here's the full filename.
sFileName = "DUKE3D.EXE"

' Look for the first occurrence of "." in {sFileName}. If there
' is none, iOffset is set to 0, which will crash the LEFT$()
' call; in that case, we set iOffset to the length of the string
' plus 1, and let LEFT$() absorb the whole string. Note that
' MID$() accepts out of bounds offsets (as long as it is greater
' than 0.)
iOffset = INSTR(sFileName, ".")
IF (iOffset = 0) THEN
  iOffset = LEN(sFileName) + 1
END IF

' Split Name and Extension.
sName = LEFT$(sFileName, iOffset - 1)
sExt = MID$(sFileName, iOffset + 1)

' Display.
PRINT "Name:"; CHR$(cChrQuote); sName; CHR$(cChrQuote)
PRINT " Ext:"; CHR$(cChrQuote); sExt; CHR$(cChrQuote)

Note that the code above only works if sFileName is a proper DOS file name, but it will screw things up if it contains more than one dot because INSTR returns the offset of the first occurrence of the searched string from a given location. Since we DO want the last occurrence of a dot in the string, we should be using a reverse in-string function called INSTRREV instead, which reads the source string right to left.

While it's a fairly common function that exists in FreeBASIC, VisualBASIC and other languages too (IndexOf() and LastIndexOf() in JavaScript for example,) INSTRREV is not available in QuickBASIC. Let's write a replacement for it:

CONST cChrQuote = 34

DIM sFileName AS STRING, sName AS STRING, sExt AS STRING
DIM iOffset AS INTEGER

sFileName = "DUKE3DEXE"

' Look for the last occurrence of "." in {sFileName}.
iOffset = INSTRREV%(LEN(sFileName), sFileName, ".")
IF (iOffset = 0) THEN
  iOffset = LEN(sFileName) + 1
END IF

' Split Name and Extension.
sName = LEFT$(sFileName, iOffset - 1)
sExt = MID$(sFileName, iOffset + 1)

' Display.
PRINT "Name:"; CHR$(cChrQuote); sName; CHR$(cChrQuote)
PRINT " Ext:"; CHR$(cChrQuote); sExt; CHR$(cChrQuote)

''
'' Parse the source string {sSource} from right to left and
'' return the location of the first byte of {sFind} if the
'' sub-string exists, if the sub-string doesn't exist, the
'' function returns 0. The {iStart} argument is optional for
'' INSTR but unfortunately we cannot overload functions in
'' QuickBASIC; always set {iStart} to a value that is at least
'' LEN({sSource}).
''
FUNCTION INSTRREV% (iStart AS INTEGER, sSource AS STRING, sFind AS STRING)
  DIM iFindLen AS INTEGER, iOffset AS INTEGER, iFrom AS INTEGER

  ' Get the length of the searched string, in bytes.
  iFindLen = LEN(sFind)

  ' Copy {iStart} to {iFrom} because we may have to adjust the
  ' offset; in QuickBASIC, arguments are passed ByRef, meaning
  ' that if we modify {iStart} now, its value will be updated
  ' when the function exits!
  iFrom = iStart

  ' The starting offset is "too early" (or too far right of the
  ' source string,) adjust the offset so it makes sense. Note
  ' that if {sFind} is longer than {sSource}, the starting
  ' offset will be negative!
  IF ((iFrom + iFindLen - 1) > LEN(sSource)) THEN
    iFrom = LEN(sSource) - (iFindLen - 1)
  END IF

  ' The starting offset must be at least 1, or it is "too late"
  ' (or too far left of the source string.) The starting offset
  ' must be at least 1 to have a fair chance at a match.
  IF (iFrom > 0) THEN

    ' Start reading from the given position all the way to the
    ' beginning of the string. Return the starting position of
    ' {sFind} in {sSource} as soon as we have a match.
    FOR iOffset = iFrom TO 1 STEP -1
      IF (MID$(sSource, iOffset, iFindLen) = sFind) THEN
        INSTRREV% = iOffset
        EXIT FUNCTION
      END IF
    NEXT iOffset

  END IF

  ' {sFind} is not in {sSource}.
  INSTRREV% = 0
END FUNCTION

Sub-string count

Another possible application of INSTR is to count the number of times a sub-string appears in a source string. Have you ever wondered how many times does the letter "i" appear in "Supercalifragilisticexpialidocious"? Me neither, but let's find out!

CONST cChrQuote = 34

DIM sText AS STRING, sLetter AS STRING

' Shift both strings to uppercase; if we don't, the program will
' consider "i" and "I" to be a different letters and we really
' don't want that.
sText = UCASE$("Supercalifragilisticexpialidocious")
sLetter = UCASE$("i")

' And here's the result.
PRINT "The sub-string "; CHR$(cChrQuote); sLetter; CHR$(cChrQuote); " appears";
PRINT STRCOUNT%(sText, sLetter); "times in "; CHR$(cChrQuote); sText; CHR$(cChrQuote)

''
'' Count the number of times {sFind} appears in {sSource}. The
'' function is case-sensitive.
''
FUNCTION STRCOUNT% (sSource AS STRING, sFind AS STRING)
  DIM iOffset AS INTEGER, iCount AS INTEGER
  DIM iLength AS INTEGER

  ' Get the length of the searched string; we're going to need
  ' that info in a loop, better cache it now.
  iLength = LEN(sFind)

  ' Get the location of the first occurrence of {sFind}; If
  ' there's none, iOffset is 0 and the loop doesn't trigger.
  iOffset = INSTR(sSource, sFind)
  DO WHILE (iOffset)

    ' We got a match, increment counter.
    iCount = iCount + 1

    ' We got the offset of the current occurrence. What we need
    ' to do now, is look for another occurrence AFTER the current
    ' offset; that's why we increment iOffset in the INSTR call.
    ' Note that we use iLength instead of simply adding 1
    ' because we may search for a group of letters rather than
    ' just one. If INSTR fails to find anything, iOffset will be
    ' set to 0 and the loop will terminate.
    iOffset = INSTR(iOffset + iLength, sSource, sFind)

  LOOP

  ' Return count.
  STRCOUNT% = iCount
END FUNCTION

String splitter

If we mix the functionality of the name and extension splitter with the design of the sub-string counter, we can write a pretty dope function to split a string when a certain marker is found. For instance, we may have a string that contains multiple fields, and each field is separated by tabs (ASCII value 9.) Let's write that:

CONST cChrTab = 9
CONST cChrQuote = 34

DIM sSource AS STRING
DIM sToken AS STRING
DIM sSep AS STRING

' Set separator.
sSep = CHR$(cChrTab)

' Four-field test string.
sSource = "1st" + sSep + "2nd" + sSep + "3rd" + sSep + "4th"

' Initialize splitter, get first token if any.
sToken = STRSPLIT$(sSource, sSep)

' Keep going until we get an empty token.
DO WHILE LEN(sToken)
  PRINT CHR$(cChrQuote); sToken; CHR$(cChrQuote)
  sToken = STRSPLIT$("", "")
LOOP

''
'' Return individual tokens found in {sSource} and separated by
'' {sMarker}. When all tokens are parsed, the function returns
'' an empty string.
''
FUNCTION STRSPLIT$ (sSource AS STRING, sMarker AS STRING) STATIC
  DIM iOffset AS INTEGER, iLength AS INTEGER
  DIM sMessage AS STRING, iEnding AS INTEGER
  DIM sSplitAt AS STRING

  ' If {sSource} is not empty, we have a brand new string. Reset
  ' the reading offset, get the string length, and remember the
  ' splitting marker.
  IF (LEN(sSource)) THEN
    sMessage = sSource
    iOffset = 1
    iLength = LEN(sMessage)
    sSplitAt = sMarker
  END IF

  ' The offset is beyond the string size, or the source string
  ' is empty (the program would crash if {sSource} is nothing
  ' when STRSPLIT$() is called for the very first time.) Return
  ' nothing right away.
  IF ((iOffset > iLength) OR (iLength = 0)) THEN
    STRSPLIT$ = ""
    EXIT FUNCTION
  END IF

  ' Get the next line-break marker; if there's none, take the
  ' remainder of the string.
  iEnding = INSTR(iOffset, sMessage, sSplitAt)
  IF (iEnding = 0) THEN
    iEnding = iLength + 1
  END IF

  ' The offset and the end marker are identical. This may happen
  ' if a field is empty. However, returning an empty string
  ' would likely force the calling routine to terminate the
  ' processing of the string, which would be a mistake. In that
  ' special case, return a single space.
  IF (iEnding = iOffset) THEN
    STRSPLIT$ = " "
  ' Working as intended.
  ELSE
    STRSPLIT$ = MID$(sMessage, iOffset, iEnding - iOffset)
  END IF

  ' Advance reading offset for next call.
  iOffset = iEnding + LEN(sSplitAt)
END FUNCTION

The function above is very useful but far from perfect. For instance, it's not possible to process tokens from two different strings at the same time as the function would either reset the reading offset continuously, or it would confuse the reading position from one string with the reading offset of the other (STATIC is at fault here - consider passing a value for the starting offset and update that value when the function terminates.)

Also, the function is designed to return an empty string when all tokens have been parsed, but there's a special case in which the field may be empty (which is not an input error) and that would force the function to return nothing when in reality, more tokens may still be present. To avoid the issue, STRSLIT$() returns a single space... it gets the job done, but it's a dirty quirk.

Finally (and this is a very contextual sort of issue,) the function will look for a given separator anywhere in the source string, which may be a problem if we actually do not want to consider the marker as a separator. For instance, we may use comma as a separator for tokens, but we would also like some tokens to be able to contain comma as a plain text symbol. Usually, this is processed by either detecting "escaped characters" (characters preceded by a backslash) or non-breaking tokens (placed between quotes.)

Another thing we could do (and this is more of a wish list thing than an actual shortcoming) is to write a reverse split, which would split tokens backward (starting from the right end of the string.) It's again a very niche tool, but I needed something similar not too long ago for my job. I don't want to get into details, but the run-down is basically: here's a string that contains a fixed number of space-separated fields. The catch is that, at two locations in the string, there were fields that contained spaces and needed to be stitched together. Of course, those fields were not embedded in quotes, because where's the fun in THAT?

' ID is always an integer, MESSAGE may contain spaces, DATE is
' always YYYY-MM-DD, AMOUNT may contain spaces, SIGN is always
' "+" or "-"...
sSource = "143 TRANSACTION FROM ACCOUNT 01.234.124 2007-12-01 1 150 +"

The above string had to produce the following output:

sOrderNum = "143"
sOrderMsg = "TRANSACTION FROM ACCOUNT 01.234.124"
sOrderDte = "2007-12-01"
sOrderCur = "1,150"
sOrderSgn = "+"

The solution was to first read one token on the left end of the string (sOrderNum) and place a marker. Then read one token from the right (sOrderSng.) Keep reading more tokens from the right and concatenate them (sOrderCur) until the token matches a date format (sOrderDte.) Place a marker. Now extract the string between the two markers (sOrderMsg)... it's a lot of work to parse a document that should have been XML, CSV, or plain fixed-length fields in the first place. You have to be creative sometimes.

Sub-string replacement

Another cool function we may devise from INSTR replaces all occurrences of a sub-string inside a source string. It's super simple and useful:

DIM sSource AS STRING, sSeek AS STRING, sReplacement AS STRING

' Initialize the source string, the current sub-string we want
' to replace and the new content for the sub-string.
sSource = "How much wood would a woodsneed sneed if a woodsneed could sneed wood?"
sSeek = "sneed"
sReplacement = "chuck"

' For those of you that don't understand the joke. The sign is a
' subtle joke. The shop is called "Sneed's Feed & Seed", where
' feed and seed both end in the sound "-eed", thus rhyming with
' the name of the owner, Sneed. The sign says that the shop was
' "Formerly Chuck's", implying that the two words beginning with
' "F" and "S" would have ended with "-uck", rhyming with
' "Chuck". So, when Chuck owned the shop, it would have been
' called something I cannot really use to illustrate what one
' would call "Clean Safe Fun" QuickBASIC code.
PRINT REPLACE$(sSource, sSeek, sReplacement)

''
'' Return a version of {sSource} where all occurrences of {sOld}
'' have been replaced by {sNew}.
''
FUNCTION REPLACE$ (sSource AS STRING, sOld AS STRING, sNew AS STRING)
  DIM sOutput AS STRING, iOffset AS INTEGER
  DIM iLenNew AS INTEGER, iLenOld AS INTEGER

  ' Get the length, in bytes, of the old and new strings. We're
  ' going to need those values quite often.
  iLenOld = LEN(sOld)
  iLenNew = LEN(sNew)

  ' Copy the source string to the output.
  sOutput = sSource

  ' Look for {sOld} in the source.
  iOffset = INSTR(sOutput, sOld)
  DO WHILE (iOffset)
    ' We got a match. Take every character located before the
    ' match, write the new string, then take every character
    ' located after the match (we skip the length of the old
    ' string in the process.)
    sOutput = LEFT$(sOutput, iOffset - 1) + sNew + MID$(sOutput, iOffset + iLenOld)

    ' Find the next occurrence of {sOld} after the split offset
    ' plus the length of the replacement string (so that we do
    ' not fall into an endless loop if {sOld} is part of {sNew})
    iOffset = INSTR(iOffset + iLenNew, sOutput, sOld)
  LOOP

  ' Return the output string.
  REPLACE$ = sOutput
END FUNCTION

Quick input evaluation

And for our final trick of the night: a less obvious use of INSTR... in a simple command-line program, the user may be invited to select options by the press of a single key. For instance, "The file already exists, replace? [Y/N/A]" In this case, the program prompts the user about an existing file that is going to be replaced. The user may chose "Yes" (to replace the file) "No" (to leave the file as-is) or "Always" (to always replace existing files if such situation is encountered again.) The straightforward way of handling such case would be:

DIM sPrompt AS STRING

DO
  ' Ask the user what do to.
  PRINT "The file already exists, replace? [Y/N/A]"

  ' Loop until the user presses a key (we check against 1 to
  ' ignore special inputs like arrow keys and function keys)
  DO: sPrompt = INKEY$: LOOP UNTIL (LEN(sPrompt) = 1)

  ' Force the key to upper case and check.
  sPrompt = UCASE$(sPrompt)
  IF (sPrompt = "Y") THEN
    PRINT "Overwrite"
    EXIT DO
  ELSEIF (sPrompt = "N") THEN
    PRINT "Skip"
    EXIT DO
  ELSEIF (sPrompt = "A") THEN
    PRINT "Always overwrite"
    EXIT DO
  END IF

  ' If the user pressed another key, ask again.
  PRINT "Let's try that again."
LOOP

This is probably nitpicking for most people, but string evaluations are slower than integer evaluations, and every time a string is mentioned (for evaluation or used as an argument in routines or functions,) the compiler will produce a truckload of code, far more than when we use integers. It is possible to mitigate the issue with INSTR:

DIM sPrompt AS STRING

DO
  ' Ask the user what do to.
  PRINT "The file already exists, replace? [Y/N/A]"

  ' Loop until the user presses a key (we check against 1 to
  ' ignore special inputs like arrow keys and function keys)
  DO: sPrompt = INKEY$: LOOP UNTIL (LEN(sPrompt) = 1)

  ' Get the offset of sPrompt in "YyNnAa"; if the user pressed
  ' one of these keys, the output will be a value between 1 and
  ' 6. If we increment the value by one and proceed with a
  ' integer division, we obtain a value between 1 and 3. If the
  ' user pressed another key, the result will be 0 (plus 1
  ' equals 1, followed by an integer division by 2 equals 0.)
  SELECT CASE ((INSTR("YyNnAa", sPrompt) + 1) \ 2)
  CASE 1
    PRINT "Overwrite"
    EXIT DO
  CASE 2
    PRINT "Skip"
    EXIT DO
  CASE 3
    PRINT "Always overwrite"
    EXIT DO
  END SELECT

  ' If the user pressed another key, ask again.
  PRINT "Let's try that again."
LOOP