'=========================================================================== ' Subject: BOYER-MOORE INSTR Date: 12-05-95 (00:00) ' Author: Jamshid Khoshrangi Code: PB ' Origin: qjackson@direct.ca Packet: PB.ABC '=========================================================================== $IF 0 BOYER.BAS BOYER.BAS Boyer-Moore INSTR Translated from Pure MASM to PB Inline Assembler by Quinn Tyler Jackson (aka "Jamshid") qjackson@direct.ca Copyright 1994 by Michael Abrash Modifications Copyright 1995 by AhuraMazda(tm) Software (Distributed with Abrash's permission.) NOTES: I have a great book in my library of reference works called _Zen of Code Optimization_ by Michael Abrash (Coriolis Group Books, AZ, 1994). Chapter 14 deals entirely with what is known as the Boyer-Moore String Search. Abrash includes with his book the ASM code for a C-linkable, small memory model Boyer-Moore string search. I decided, in my folly, to do my best to translate Abrash's crowning achievement into PowerBASIC inline assembler. It wasn't an easy task. First of all, I am not an assembler wizard. I can fumble my way around well-commented ASM, but it's not my native tongue, to say the least. Abrash's code was written for the small model, which assumes that CS, ES, and SS are all one and the same segment. Although this makes Abrash's code elegant, it also made it a nightmare for me to rework for a CS, ES, and SS that could very well reside in three very distant segments. Let's just say that this little bit of conversion took me about three days of steady study, an open MASM 6.1 reference manual, and a lot of hot chocolate, but I think I've managed.... Especially difficult and frustrating were PowerBASIC's constant FIXUP OVERFLOW errors during compilation. Put simply, BINSTR(), which stands for "Boyer INSTR" does what INSTR does, but in some cases, it does it better. Be that as it may, in some pathological cases, it does it worse. BINSTR is the hacker's INSTR, then. As long as the programmer has a far understanding of the kind of strings he will be employing, the speed gains with BINSTR are phenomenal. If the data stream is uncertain, INSTR is possibly the better generic choice. I can only warn the programmer: "Know Thy Data!" BOYER-MOORE STRING SEARCHES: Just about every decent book on computer science theory talks about the theoretically superlative "Boyer-Moore String Search" and leaves it at that. Just what is the beast? Consider the brute force method of finding a substring within a string. Essentially, we click ahead one byte at a time, comparing it to the first byte of our Pattern, until the byte matches. Then we compare the second byte of the Pattern with the next byte, until we fail. When we fail, we advance in the buffer one byte and continue this way until we hit another first byte match. Either we find what we're looking for, or we don't. Meanwhile, we waste a lot of cycles doing things over and over again. It makes intuitive sense that the assembler instructions REPNZ SCASB will be about the fastest search for a substring pattern in a buffer that there is. After all, how could 200 lines of assembler be FASTER than a simple REPNZ SCASB? Well, consider this: Suppose we're looking for the pattern "XYZ" in a buffer containing "XYAXYZ". We match on "X", we match on "Y", and we mismatch on "A". A close look at this tells us something: we don't have to do our next search starting at the second byte.... We can skip ahead to byte four in the buffer. Now consider something else: we wasted two hits to be disappointed by a miss at the end. What would happen if we compared the last byte of the pattern with the third byte of the buffer first? A mismatch of the rightmost character of a pattern tells us much more than a mismatch of the leftmost character. If the mismatched character in the buffer doesn't appear in the pattern, then we've not just eliminated ONE potential match, but as many potential matches as there are characters in the pattern. It gets quite technical from this point on. Read Abrash's book if you can't follow the code. What it all amounts to are these rules of thumb: The longer the pattern to match, the faster the search when using a Boyer-Moore engine. The nature of the data affects the speed of the search when using a Boyer-Moore engine. That said, searching for an "A" in a stream with only one "A" at the very end of it would prove slow using BINSTR. Searching for a 1K pattern in a large stream proves amazingly efficient when compared to PowerBASIC's INSTR. Again: "Know Thy Data!" FINAL WORD: I can take none of the credit for either the algorithm or its implementation. I took the time to provide a PowerBASIC 3.2 version of Abrash's code, that's all. Have fun with it. Jamshid $ENDIF $ERROR BOUNDS ON $DIM ALL DEFINT A-Z DECLARE FUNCTION BINSTR (_ TheStart AS INTEGER,_ Target AS STRING,_ Rule AS STRING) %DEBUG = -1 $IF %DEBUG DIM Rule AS STRING,_ Target AS STRING OPEN "PB.EXE" FOR BINARY ACCESS READ AS #1 GET$ #1, 1000 , Rule$ GET$ #1, 28000 , Target$ ' Tag some near random stuff to the ' beginning to make things interesting Target$ = Target$ + Rule$ CLOSE #1 CLS VIEW TEXT (10,7)-(75,20) PRINT "Boyer-Moore String Search Implementation for PowerBASIC 3.2" PRINT PRINT "Translated from Michael Abrash's code by Jamshid Khoshrangi" PRINT DIM n AS INTEGER MTIMER n = BINSTR(1, Target$, Rule$) PRINT ,,"Boyer INSTR: "; MTIMER; "microseconds" PRINT ,,"Found at offset: ";n PRINT MTIMER n= INSTR(Target$, Rule$) PRINT ,,"PB'S INSTR: ";MTIMER; "microseconds" PRINT ,,"Found at offset: "; n $ENDIF ' I've found that the Boyer-Moore method works best on patterns ' that are 5 or more bytes long. The BINSTR() FUNCTION will use ' a standard INSTR() on shorter patterns, incurring about 40 to ' 50 microseconds of overhead on my system. %SMALLEST.PATTERN = 2 %SMALLEST.BUFFER = 4 FUNCTION BINSTR (_ TheStart AS INTEGER,_ Target AS STRING,_ Rule AS STRING _ ) PUBLIC AS INTEGER ' ES:DI Target ' DS:SI Rule DIM PatLen AS INTEGER,_ BuffLen AS INTEGER DIM StackSeg AS WORD,_ PatSeg AS WORD,_ PatPtr AS WORD,_ BuffSeg AS WORD,_ BuffPtr AS WORD,_ TrueBuffPtr AS WORD PatLen = LEN(Rule) BuffLen = LEN(Target) IF BuffLen < %SMALLEST.BUFFER THEN FUNCTION = INSTR(TheStart, Target, Rule) EXIT FUNCTION ELSE IF PatLen < %SMALLEST.PATTERN THEN FUNCTION = INSTR(TheStart, Target, Rule) EXIT FUNCTION END IF END IF IF TheStart > BuffLen THEN ' we might as well not even bother EXIT FUNCTION END IF StackSeg = 0 ' We do this for later inline ASM PatSeg = STRSEG(Rule) PatPtr = STRPTR(Rule) BuffSeg = STRSEG(Target) BuffPtr = STRPTR(Target) + TheStart TrueBuffPtr = BuffPtr ! cld ! push sp ! push si ;preserve caller's register variables ! push di ! push es ! push ds ! sub sp, 256*2 ;allocate space for SkipTable on our ! ; stack ' Create the table of distances by which to skip ahead on mismatches ' for every possible byte value. First, initialize all skips to the ' pattern length; this is the skip distance for bytes that don't ' appear in the pattern. ! mov ax,PatLen ! and ax,ax ;return an instant match if the ! jz InstantMatch ; pattern 0-length ! mov StackSeg,ss ! mov es,StackSeg ! mov di,sp ;point to SkipBuffer ! mov cx,256 ! rep stosw ! mov es,BuffSeg ! mov ds,PatSeg ! dec ax ;from now on, we only need ! mov PatLen,ax ; PatLen - 1 ' Point to last (rightmost) byte of first potential pattern match ' location in buffer. ! add BuffPtr, ax ' Reject if buffer is too small, and set the count of the number of ' potential pattern match locations in the buffer. ! sub BuffLen,ax ! jbe NoMatch ' Set the skip values for the bytes that do appear in the pattern ' to the distance from the byte location to the end of the pattern. ' When there are multiple instances of the same byte, the rightmost ' instance's skip value is used. Note that the rightmost byte of the ' pattern isn't entered in the skip table; if we get that value for ' a mismatch, we know for sure that the right end of the pattern has ' already passed the mismatch location, so this is not a relevant ' byte for skipping purposes. ! mov si,PatPtr ;point to start of pattern ! and ax,ax ;are there any skips to set? ! jz SetSkipDone ;no ! mov di,sp ;point to SkipBuffer SetSkipLoop: ! sub bx,bx ;prepare for word addressing off byte value ! mov bl,ds:[si] ;get the next pattern byte ! inc si ;advance the pattern pointer ! shl bx,1 ;prepare for word look-up ! mov ss:[di+bx],ax ;set the skip value when this byte value ! ;is the mismatch value in the buffer ! dec ax ! jnz SetSkipLoop SetSkipDone: ! mov dl,ds:[si] ;DL=rightmost pattern byte from now on ! dec si ;point to next-to-rightmost byte of pattern ! mov PatPtr,si ; from now on ' Search the buffer. ! std ;for backward REPZ CMPSB ! mov di,BuffPtr ;point to the first search location ! mov cx,BuffLen ;# of match locations to check SearchLoop: ! mov si,sp ;point SI to SkipTable ' Skip through until there's a match for the rightmost pattern ' byte. ' We jump this way to avoid a FIXUP OVERFLOW under PB... ! jmp QuickSearchLoop InstantMatch: ! mov ax,BuffPtr ! jmp short Done ' Compare the pattern and the buffer location, searching from high ' memory toward low (right to left). FullCompare: ! mov BuffPtr,di ;save the current state of ! mov BuffLen,cx ; the search ! mov cx,PatLen ;# of bytes yet to compare ! jcxz Matched ;done if there was only one character ! mov si,PatPtr ;point to next-to-rightmost bytes ! dec di ; of buffer location and pattern ! repz cmpsb ;compare the rest of the pattern ! jz Matched ;that's it; we've found a match ' It's a mismatch; let's see what we can learn from it. ! inc di ;compensate for 1-byte overrun of REPZ CMPSB; ! ; point to mismatch location in buffer ! ; # of bytes that did match. ! mov si,BuffPtr ! sub si,di ' If, based on the mismatch character, we can't even skip ahead ' as far as where we started this particular comparison, then ' just advance by 1 to the next potential match; otherwise, skip ' ahead from this comparison location by the skip distance for ' the mismatch character, less the distance covered by the ' partial match. ! sub bx,bx ;prepare for word addressing off byte value ! mov bl,es:[di] ;get the value of the mismatch byte in buffer ! add bx,bx ;prepare for word look-up ! add bx,sp ;SP points to SkipTable ! mov cx,ss:[bx] ;get the skip value for this mismatch ! mov ax,1 ;assume we'll just advance to the next ! ; potential match location ! sub cx,si ;is the skip far enough to be worth taking? ! jna MoveAhead ;no, go with the default advance of 1 ! mov ax,cx ;yes; this is the distance to skip ahead from ! ; the last potential match location checked MoveAhead: ' Skip ahead and perform the next comparison, if there's any buffer ' left to check. ! mov di,BuffPtr ! add di,ax ;BuffPtr += Skip ! mov cx,BuffLen ! sub cx,ax ;BuffLen -= Skip; ! ja SearchLoop ;continue if any buffer left ' Return a NULL pointer for no match. NoMatch: ! sub ax,ax ! jmp short Done ' Return start of match in buffer (BuffPtr - (PatLen - 1)). Matched: ! mov ax, BuffPtr ! sub ax, TrueBuffPtr ! sub ax, PatLen ! les bx, [bp+&HE] ! add ax, es:[bx] ; Parameter: TheStart ! inc ax ! mov FUNCTION, ax Done: ! cld ! add sp,256*2 ;deallocate space for SkipTable ! pop ds ! pop es ! pop di ;restore caller's register variables ! pop si ! pop sp EXIT FUNCTION QuickSearchLoop: ! mov bl,es:[di] ;rightmost buffer byte at this location ! cmp dl,bl ;does it match the rightmost pattern byte? ! jz FullCompare ;yes, so keep going ! sub bh,bh ;convert to a word ! add bx,bx ;prepare for look-up in SkipTable ! mov ax,ss:[si+bx] ;get skip value from skip table for this ! ; mismatch value ! add di,ax ;BuffPtr += Skip; ! sub cx,ax ;BuffLen -= Skip; ! ja QuickSearchLoop;continue if any buffer left ! jmp short NoMatch ' Return a pointer to the start of the buffer (for 0-length ' pattern). END FUNCTION