mirror of
				https://github.com/python/cpython.git
				synced 2025-10-31 05:31:20 +00:00 
			
		
		
		
	
		
			
	
	
		
			432 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
		
		
			
		
	
	
			432 lines
		
	
	
	
		
			16 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
|   | This document explains Crochemore and Perrin's Two-Way string matching | ||
|  | algorithm, in which a smaller string (the "pattern" or "needle") | ||
|  | is searched for in a longer string (the "text" or "haystack"), | ||
|  | determining whether the needle is a substring of the haystack, and if | ||
|  | so, at what index(es). It is to be used by Python's string | ||
|  | (and bytes-like) objects when calling `find`, `index`, `__contains__`, | ||
|  | or implicitly in methods like `replace` or `partition`. | ||
|  | 
 | ||
|  | This is essentially a re-telling of the paper | ||
|  | 
 | ||
|  |     Crochemore M., Perrin D., 1991, Two-way string-matching, | ||
|  |         Journal of the ACM 38(3):651-675. | ||
|  | 
 | ||
|  | focused more on understanding and examples than on rigor. See also | ||
|  | the code sample here: | ||
|  | 
 | ||
|  |     http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260 | ||
|  | 
 | ||
|  | The algorithm runs in O(len(needle) + len(haystack)) time and with | ||
|  | O(1) space. However, since there is a larger preprocessing cost than | ||
|  | simpler algorithms, this Two-Way algorithm is to be used only when the | ||
|  | needle and haystack lengths meet certain thresholds. | ||
|  | 
 | ||
|  | 
 | ||
|  | These are the basic steps of the algorithm: | ||
|  | 
 | ||
|  |     * "Very carefully" cut the needle in two. | ||
|  |     * For each alignment attempted: | ||
|  |         1. match the right part | ||
|  |             * On failure, jump by the amount matched + 1 | ||
|  |         2. then match the left part. | ||
|  |             * On failure jump by max(len(left), len(right)) + 1 | ||
|  |     * If the needle is periodic, don't re-do comparisons; maintain | ||
|  |       a "memory" of how many characters you already know match. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- Matching the right part -------- | ||
|  | 
 | ||
|  | We first scan the right part of the needle to check if it matches the | ||
|  | the aligned characters in the haystack. We scan left-to-right, | ||
|  | and if a mismatch occurs, we jump ahead by the amount matched plus 1. | ||
|  | 
 | ||
|  | Example: | ||
|  | 
 | ||
|  |        text:    ........EFGX................... | ||
|  |     pattern:    ....abcdEFGH.... | ||
|  |         cut:        <<<<>>>> | ||
|  | 
 | ||
|  | Matched 3, so jump ahead by 4: | ||
|  | 
 | ||
|  |        text:    ........EFGX................... | ||
|  |     pattern:        ....abcdEFGH.... | ||
|  |         cut:            <<<<>>>> | ||
|  | 
 | ||
|  | Why are we allowed to do this? Because we cut the needle very | ||
|  | carefully, in such a way that if the cut is ...abcd + EFGH... then | ||
|  | we have | ||
|  | 
 | ||
|  |         d != E | ||
|  |        cd != EF | ||
|  |       bcd != EFG | ||
|  |      abcd != EFGH | ||
|  |           ... and so on. | ||
|  | 
 | ||
|  | If this is true for every pair of equal-length substrings around the | ||
|  | cut, then the following alignments do not work, so we can skip them: | ||
|  | 
 | ||
|  |        text:    ........EFG.................... | ||
|  |     pattern:     ....abcdEFGH.... | ||
|  |                         ^   (Bad because d != E) | ||
|  |        text:    ........EFG.................... | ||
|  |     pattern:      ....abcdEFGH.... | ||
|  |                         ^^   (Bad because cd != EF) | ||
|  |        text:    ........EFG.................... | ||
|  |     pattern:       ....abcdEFGH.... | ||
|  |                         ^^^   (Bad because bcd != EFG) | ||
|  | 
 | ||
|  | Skip 3 alignments => increment alignment by 4. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- If len(left_part) < len(right_part) -------- | ||
|  | 
 | ||
|  | Above is the core idea, and it begins to suggest how the algorithm can | ||
|  | be linear-time. There is one bit of subtlety involving what to do | ||
|  | around the end of the needle: if the left half is shorter than the | ||
|  | right, then we could run into something like this: | ||
|  | 
 | ||
|  |        text:    .....EFG...... | ||
|  |     pattern:       cdEFGH | ||
|  | 
 | ||
|  | The same argument holds that we can skip ahead by 4, so long as | ||
|  | 
 | ||
|  |        d != E | ||
|  |       cd != EF | ||
|  |      ?cd != EFG | ||
|  |     ??cd != EFGH | ||
|  |          etc. | ||
|  | 
 | ||
|  | The question marks represent "wildcards" that always match; they're | ||
|  | outside the limits of the needle, so there's no way for them to | ||
|  | invalidate a match. To ensure that the inequalities above are always | ||
|  | true, we need them to be true for all possible '?' values. We thus | ||
|  | need cd != FG and cd != GH, etc. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- Matching the left part -------- | ||
|  | 
 | ||
|  | Once we have ensured the right part matches, we scan the left part | ||
|  | (order doesn't matter, but traditionally right-to-left), and if we | ||
|  | find a mismatch, we jump ahead by | ||
|  | max(len(left_part), len(right_part)) + 1. That we can jump by | ||
|  | at least len(right_part) + 1 we have already seen: | ||
|  | 
 | ||
|  |        text: .....EFG..... | ||
|  |     pattern:  abcdEFG | ||
|  |     Matched 3, so jump by 4, | ||
|  |     using the fact that d != E, cd != EF, and bcd != EFG. | ||
|  | 
 | ||
|  | But we can also jump by at least len(left_part) + 1: | ||
|  | 
 | ||
|  |        text: ....cdEF..... | ||
|  |     pattern:   abcdEF | ||
|  |     Jump by len('abcd') + 1 = 5. | ||
|  | 
 | ||
|  |     Skip the alignments: | ||
|  |        text: ....cdEF..... | ||
|  |     pattern:    abcdEF | ||
|  |        text: ....cdEF..... | ||
|  |     pattern:     abcdEF | ||
|  |        text: ....cdEF..... | ||
|  |     pattern:      abcdEF | ||
|  |        text: ....cdEF..... | ||
|  |     pattern:       abcdEF | ||
|  | 
 | ||
|  | This requires the following facts: | ||
|  |        d != E | ||
|  |       cd != EF | ||
|  |      bcd != EF? | ||
|  |     abcd != EF?? | ||
|  |          etc., for all values of ?s, as above. | ||
|  | 
 | ||
|  | If we have both sets of inequalities, then we can indeed jump by | ||
|  | max(len(left_part), len(right_part)) + 1. Under the assumption of such | ||
|  | a nice splitting of the needle, we now have enough to prove linear | ||
|  | time for the search: consider the forward-progress/comparisons ratio | ||
|  | at each alignment position. If a mismatch occurs in the right part, | ||
|  | the ratio is 1 position forward per comparison. On the other hand, | ||
|  | if a mismatch occurs in the left half, we advance by more than | ||
|  | len(needle)//2 positions for at most len(needle) comparisons, | ||
|  | so this ratio is more than 1/2. This average "movement speed" is | ||
|  | bounded below by the constant "1 position per 2 comparisons", so we | ||
|  | have linear time. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- The periodic case -------- | ||
|  | 
 | ||
|  | The sets of inequalities listed so far seem too good to be true in | ||
|  | the general case. Indeed, they fail when a needle is periodic: | ||
|  | there's no way to split 'AAbAAbAAbA' in two such that | ||
|  | 
 | ||
|  |     (the stuff n characters to the left of the split) | ||
|  |     cannot equal | ||
|  |     (the stuff n characters to the right of the split) | ||
|  |     for all n. | ||
|  | 
 | ||
|  | This is because no matter how you cut it, you'll get | ||
|  | s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the | ||
|  | needle in two so that n can be as big as possible. If we were to | ||
|  | split it as | ||
|  | 
 | ||
|  |     AAbA + AbAAbA | ||
|  | 
 | ||
|  | then A == A at the split, so this is bad (we failed at length 1), but | ||
|  | if we split it as | ||
|  | 
 | ||
|  |     AA + bAAbAAbA | ||
|  | 
 | ||
|  | we at least have A != b and AA != bA, and we fail at length 3 | ||
|  | since ?AA == bAA. We already knew that a cut to make length-3 | ||
|  | mismatch was impossible due to the period, but we now see that the | ||
|  | bound is sharp; we can get length-1 and length-2 to mismatch. | ||
|  | 
 | ||
|  | This is exactly the content of the *critical factorization theorem*: | ||
|  | that no matter the period of the original needle, you can cut it in | ||
|  | such a way that (with the appropriate question marks), | ||
|  | needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period. | ||
|  | 
 | ||
|  | Even "non-periodic" strings are periodic with a period equal to | ||
|  | their length, so for such needles, the CFT already guarantees that | ||
|  | the algorithm described so far will work, since we can cut the needle | ||
|  | so that the length-k chunks on either side of the cut mismatch for all | ||
|  | k < len(needle). Looking closer at the algorithm, we only actually | ||
|  | require that k go up to max(len(left_part), len(right_part)). | ||
|  | So long as the period exceeds that, we're good. | ||
|  | 
 | ||
|  | The more general shorter-period case is a bit harder. The essentials | ||
|  | are the same, except we use the periodicity to our advantage by | ||
|  | "remembering" periods that we've already compared. In our running | ||
|  | example, say we're computing | ||
|  | 
 | ||
|  |     "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA". | ||
|  | 
 | ||
|  | We cut as AA + bAAbAAbA, and then the algorithm runs as follows: | ||
|  | 
 | ||
|  |     First alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |     AAbAAbAAbA | ||
|  |       ^^X | ||
|  |     - Mismatch at third position, so jump by 3. | ||
|  |     - This requires that A!=b and AA != bA. | ||
|  | 
 | ||
|  |     Second alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |        AAbAAbAAbA | ||
|  |          ^^^^^^^^ | ||
|  |         X | ||
|  |     - Matched entire right part | ||
|  |     - Mismatch at left part. | ||
|  |     - Jump forward a period, remembering the existing comparisons | ||
|  | 
 | ||
|  |     Third alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |           AAbAAbAAbA | ||
|  |           mmmmmmm^^X | ||
|  |     - There's "memory": a bunch of characters were already matched. | ||
|  |     - Two more characters match beyond that. | ||
|  |     - The 8th character of the right part mismatched, so jump by 8 | ||
|  |     - The above rule is more complicated than usual: we don't have | ||
|  |       the right inequalities for lengths 1 through 7, but we do have | ||
|  |       shifted copies of the length-1 and length-2 inequalities, | ||
|  |       along with knowledge of the mismatch. We can skip all of these | ||
|  |       alignments at once: | ||
|  | 
 | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                AAbAAbAAbA | ||
|  |                 ~                   A != b at the cut | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                 AAbAAbAAbA | ||
|  |                 ~~                  AA != bA at the cut | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                  AAbAAbAAbA | ||
|  |                  ^^^^X              7-3=4 match, and the 5th misses. | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                   AAbAAbAAbA | ||
|  |                    ~                A != b at the cut | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                    AAbAAbAAbA | ||
|  |                    ~~               AA != bA at the cut | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                     AAbAAbAAbA | ||
|  |                       ^X            7-3-3=1 match and the 2nd misses. | ||
|  |         bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                      AAbAAbAAbA | ||
|  |                       ~             A != b at the cut | ||
|  | 
 | ||
|  |     Fourth alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                  AAbAAbAAbA | ||
|  |                    ^X | ||
|  |     - Second character mismatches, so jump by 2. | ||
|  | 
 | ||
|  |     Fifth alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                   AAbAAbAAbA | ||
|  |                     ^^^^^^^^ | ||
|  |                    X | ||
|  |     - Right half matches, so use memory and skip ahead by period=3 | ||
|  | 
 | ||
|  |     Sixth alignment: | ||
|  |     bbbAbbAAbAAbAAbbbAAbAAbAAbAA | ||
|  |                      AAbAAbAAbA | ||
|  |                      mmmmmmmm^^ | ||
|  |     - Right part matches, left part is remembered, found a match! | ||
|  | 
 | ||
|  | The one tricky skip by 8 here generalizes: if we have a period of p, | ||
|  | then the CFT says we can ensure the cut has the inequality property | ||
|  | for lengths 1 through p-1, and jumping by p would line up the | ||
|  | matching characters and mismatched character one period earlier. | ||
|  | Inductively, this proves that we can skip by the number of characters | ||
|  | matched in the right half, plus 1, just as in the original algorithm. | ||
|  | 
 | ||
|  | To make it explicit, the memory is set whenever the entire right part | ||
|  | is matched and is then used as a starting point in the next alignment. | ||
|  | In such a case, the alignment jumps forward one period, and the right | ||
|  | half matches all except possibly the last period. Additionally, | ||
|  | if we cut so that the left part has a length strictly less than the | ||
|  | period (we always can!), then we can know that the left part already | ||
|  | matches. The memory is reset to 0 whenever there is a mismatch in the | ||
|  | right part. | ||
|  | 
 | ||
|  | To prove linearity for the periodic case, note that if a right-part | ||
|  | character mismatches, then we advance forward 1 unit per comparison. | ||
|  | On the other hand, if the entire right part matches, then the skipping | ||
|  | forward by one period "defers" some of the comparisons to the next | ||
|  | alignment, where they will then be spent at the usual rate of | ||
|  | one comparison per step forward. Even if left-half comparisons | ||
|  | are always "wasted", they constitute less than half of all | ||
|  | comparisons, so the average rate is certainly at least 1 move forward | ||
|  | per 2 comparisons. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- When to choose the periodic algorithm --------- | ||
|  | 
 | ||
|  | The periodic algorithm is always valid but has an overhead of one | ||
|  | more "memory" register and some memory computation steps, so the | ||
|  | here-described-first non-periodic/long-period algorithm -- skipping by | ||
|  | max(len(left_part), len(right_part)) + 1 rather than the period -- | ||
|  | should be preferred when possible. | ||
|  | 
 | ||
|  | Interestingly, the long-period algorithm does not require an exact | ||
|  | computation of the period; it works even with some long-period, but | ||
|  | undeniably "periodic" needles: | ||
|  | 
 | ||
|  |     Cut: AbcdefAbc == Abcde + fAbc | ||
|  | 
 | ||
|  | This cut gives these inequalities: | ||
|  | 
 | ||
|  |                  e != f | ||
|  |                 de != fA | ||
|  |                cde != fAb | ||
|  |               bcde != fAbc | ||
|  |              Abcde != fAbc? | ||
|  |     The first failure is a period long, per the CFT: | ||
|  |             ?Abcde == fAbc?? | ||
|  | 
 | ||
|  | A sufficient condition for using the long-period algorithm is having | ||
|  | the period of the needle be greater than | ||
|  | max(len(left_part), len(right_part)). This way, after choosing a good | ||
|  | split, we get all of the max(len(left_part), len(right_part)) | ||
|  | inequalities around the cut that were required in the long-period | ||
|  | version of the algorithm. | ||
|  | 
 | ||
|  | With all of this in mind, here's how we choose: | ||
|  | 
 | ||
|  |     (1) Choose a "critical factorization" of the needle -- a cut | ||
|  |         where we have period minus 1 inequalities in a row. | ||
|  |         More specifically, choose a cut so that the left_part | ||
|  |         is less than one period long. | ||
|  |     (2) Determine the period P_r of the right_part. | ||
|  |     (3) Check if the left part is just an extension of the pattern of | ||
|  |         the right part, so that the whole needle has period P_r. | ||
|  |         Explicitly, check if | ||
|  |             needle[0:cut] == needle[0+P_r:cut+P_r] | ||
|  |         If so, we use the periodic algorithm. If not equal, we use the | ||
|  |         long-period algorithm. | ||
|  | 
 | ||
|  | Note that if equality holds in (3), then the period of the whole | ||
|  | string is P_r. On the other hand, suppose equality does not hold. | ||
|  | The period of the needle is then strictly greater than P_r. Here's | ||
|  | a general fact: | ||
|  | 
 | ||
|  |     If p is a substring of s and p has period r, then the period | ||
|  |     of s is either equal to r or greater than len(p). | ||
|  | 
 | ||
|  | We know that needle_period != P_r, | ||
|  | and therefore needle_period > len(right_part). | ||
|  | Additionally, we'll choose the cut (see below) | ||
|  | so that len(left_part) < needle_period. | ||
|  | 
 | ||
|  | Thus, in the case where equality does not hold, we have that | ||
|  | needle_period >= max(len(left_part), len(right_part)) + 1, | ||
|  | so the long-period algorithm works, but otherwise, we know the period | ||
|  | of the needle. | ||
|  | 
 | ||
|  | Note that this decision process doesn't always require an exact | ||
|  | computation of the period -- we can get away with only computing P_r! | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- Computing the cut -------- | ||
|  | 
 | ||
|  | Our remaining tasks are now to compute a cut of the needle with as | ||
|  | many inequalities as possible, ensuring that cut < needle_period. | ||
|  | Meanwhile, we must also compute the period P_r of the right_part. | ||
|  | 
 | ||
|  | The computation is relatively simple, essentially doing this: | ||
|  | 
 | ||
|  |     suffix1 = max(needle[i:] for i in range(len(needle))) | ||
|  |     suffix2 = ... # the same as above, but invert the alphabet | ||
|  |     cut1 = len(needle) - len(suffix1) | ||
|  |     cut2 = len(needle) - len(suffix2) | ||
|  |     cut = max(cut1, cut2) # the later cut | ||
|  | 
 | ||
|  | For cut2, "invert the alphabet" is different than saying min(...), | ||
|  | since in lexicographic order, we still put "py" < "python", even | ||
|  | if the alphabet is inverted. Computing these, along with the method | ||
|  | of computing the period of the right half, is easiest to read directly | ||
|  | from the source code in fastsearch.h, in which these are computed | ||
|  | in linear time. | ||
|  | 
 | ||
|  | Crochemore & Perrin's Theorem 3.1 give that "cut" above is a | ||
|  | critical factorization less than the period, but a very brief sketch | ||
|  | of their proof goes something like this (this is far from complete): | ||
|  | 
 | ||
|  |     * If this cut splits the needle as some | ||
|  |       needle == (a + w) + (w + b), meaning there's a bad equality | ||
|  |       w == w, it's impossible for w + b to be bigger than both | ||
|  |       b and w + w + b, so this can't happen. We thus have all of | ||
|  |       the ineuqalities with no question marks. | ||
|  |     * By maximality, the right part is not a substring of the left | ||
|  |       part. Thus, we have all of the inequalities involving no | ||
|  |       left-side question marks. | ||
|  |     * If you have all of the inequalities without right-side question | ||
|  |       marks, we have a critical factorization. | ||
|  |     * If one such inequality fails, then there's a smaller period, | ||
|  |       but the factorization is nonetheless critical. Here's where | ||
|  |       you need the redundancy coming from computing both cuts and | ||
|  |       choosing the later one. | ||
|  | 
 | ||
|  | 
 | ||
|  | -------- Some more Bells and Whistles -------- | ||
|  | 
 | ||
|  | Beyond Crochemore & Perrin's original algorithm, we can use a couple | ||
|  | more tricks for speed in fastsearch.h: | ||
|  | 
 | ||
|  |     1. Even though C&P has a best-case O(n/m) time, this doesn't occur | ||
|  |        very often, so we add a Boyer-Moore bad character table to | ||
|  |        achieve sublinear time in more cases. | ||
|  | 
 | ||
|  |     2. The prework of computing the cut/period is expensive per | ||
|  |        needle character, so we shouldn't do it if it won't pay off. | ||
|  |        For this reason, if the needle and haystack are long enough, | ||
|  |        only automatically start with two-way if the needle's length | ||
|  |        is a small percentage of the length of the haystack. | ||
|  | 
 | ||
|  |     3. In cases where the needle and haystack are large but the needle | ||
|  |        makes up a significant percentage of the length of the | ||
|  |        haystack, don't pay the expensive two-way preprocessing cost | ||
|  |        if you don't need to. Instead, keep track of how many | ||
|  |        character comparisons are equal, and if that exceeds | ||
|  |        O(len(needle)), then pay that cost, since the simpler algorithm | ||
|  |        isn't doing very well. |