It's a combination of Boyer-Moore and Horspool.
You can view the C code here:
Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.
From the link above:
Answer from arshajii on Stack OverflowWhen designing the new algorithm, I used the following constraints:
- should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
- small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
- sublinear search behaviour in good cases (O(n/m))
- no worse than the current algorithm in worst case (O(nm))
- should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
- many real-life searches should be good, very few should be worst case
- reasonably simple implementation
It's a combination of Boyer-Moore and Horspool.
You can view the C code here:
Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.
From the link above:
When designing the new algorithm, I used the following constraints:
- should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test
- small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)
- sublinear search behaviour in good cases (O(n/m))
- no worse than the current algorithm in worst case (O(nm))
- should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)
- many real-life searches should be good, very few should be worst case
- reasonably simple implementation
Since 2021, CPython uses Crochemore and Perrin's Two-Way algorithm for larger n.
From the source code:
If the strings are long enough, use Crochemore and Perrin's Two-Way algorithm, which has worst-case O(n) runtime and best-case O(n/k). Also compute a table of shifts to achieve O(n/k) in more cases, and often (data dependent) deduce larger shifts than pure C&P can deduce. See stringlib_find_two_way_notes.txt in this folder for a detailed explanation.
See https://github.com/python/cpython/pull/22904
Python String Addition Time Complexity
Complexity of *in* operator in Python - Stack Overflow
time complexity - Runtime of python's if substring in string - Stack Overflow
When checking if a character is in a string vs in a set, does this impact time complexity?
Videos
I am trying to determine the time complexity of the encode function below. It goes over every string in the input list, so that's O(n), where n is the length of the input list. But in each iteration, we add to the encodedStr. Since python creates a new string every time the add operation is performed, do we need to take the length of encodedStr into account for the time complexity of the function?
class Codec:
def encode(self, strs: List[str]) -> str:
# Encodes a list of strings to a single string.
encodedStr = ''
for s in strs
encodedStr += str(len(s)) + '#' + s
return encodedStrThe complexity of in depends entirely on what L is. e in L will become L.__contains__(e).
See this time complexity document for the complexity of several built-in types.
Here is the summary for in:
- list - Average: O(n)
- set/dict - Average: O(1), Worst: O(n)
The O(n) worst case for sets and dicts is very uncommon, but it can happen if __hash__ is implemented poorly. This only happens if everything in your set has the same hash value.
It depends entirely on the type of the container. Hashing containers (dict, set) use the hash and are essentially O(1). Typical sequences (list, tuple) are implemented as you guess and are O(n). Trees would be average O(log n). And so on. Each of these types would have an appropriate __contains__ method with its big-O characteristics.
The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.
The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.
See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).
The post includes a Python-code outline of the algorithm:
def find(s, p): # find first occurrence of p in s n = len(s) m = len(p) skip = delta1(p)[p[m-1]] i = 0 while i <= n-m: if s[i+m-1] == p[m-1]: # (boyer-moore) # potential match if s[i:i+m-1] == p[:m-1]: return i if s[i+m] not in p: i = i + m + 1 # (sunday) else: i = i + skip # (horspool) else: # skip if s[i+m] not in p: i = i + m + 1 # (sunday) else: i = i + 1 return -1 # not found
as well as speed comparisons.
In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin's Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.
The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).
In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.
The algorithm used is a mix between Boyer-More and Horspool.
Let's say I have an integer as a string value and I want to see if it is a certain string int value. For example.
my_int = "3"
# using `in` keyword
if my_int in "123789":
print("accepted")
else:
print("not accepted")
# using a set
if my_int in {'1', '2', '3', '7', '8', '9'}:
print("accepted")
else:
print("not accepted")In these cases, is there a difference in time complexity? Using the in keyword seems like it might search through each character in some sort of loop, but maybe internally, python does something else to improve lookup?
Recently I was thinking about interview questions I got as an undergrad:
Things like "reverse a string" and "check if a string is a palindrome".
I did most of these in C++ with a loop and scrolling through the index using logic.
When I learned Python, I realized that I could "reverse a string" by simply going:
return mystring[::-1]
Likewise with "check if it is a palindrome" by doing:
return mystring == mystring[::-1]
The problem now is that, I don't know what kinda complexity it is.
From my point of view, it is constant, so O(1). But I am guessing that that is too good to be true as the string splicing is doing something behind the scenes.
Can anyone help me clarify?
The python page on time-complexity shows that slicing lists has a time-complexity of O(k), where "k" is the length of the slice. That's for lists, not strings, but the complexity can't be O(1) for strings since the slicing must handle more characters as the size is increased. At a guess, the complexity of slicing strings would also be O(k). We can write a little bit of code to test that guess:
import time
StartSize = 2097152
size = StartSize
for _ in range(10):
# create string of size "size"
s = '*' * size
# now time reverse slice
start = time.time()
r = s[::-1]
delta = time.time() - start
print(f'Size {size:9d}, time={delta:.3f}')
# double size of the string
size *= 2
This uses a simple method of timing. Other tools exist, but this is simple. When run I get:
$ python3 test.py
Size 2097152, time=0.006
Size 4194304, time=0.013
Size 8388608, time=0.024
Size 16777216, time=0.050
Size 33554432, time=0.098
Size 67108864, time=0.190
Size 134217728, time=0.401
Size 268435456, time=0.808
Size 536870912, time=1.610
Size 1073741824, time=3.192
which shows the time doubles when doubling the size of the string for each reverse slice. So O(n) (k == n for whole-string slicing).
Edit: spelling.
How difficult an algorithm is to write and how difficult it is to calculate are two separate things. Creating a reversed string with the shorthand still requires n order space and n order time. Keep in mind that, in most cases, creating a reversed array isn't necessary, you can just start at the top and go down, which is essentially what Python's reversed() function does