Using difflib.get_close_matches to replace word in string - Python

stackoverflow.com › questions › 72842260 › using-difflib-get-close-matches-to-replace-word-in-string-python

You could try something like this:

import difflib

possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'

newWords = []
for word in line.split():
    result = difflib.get_close_matches(word, possibilities, n=1)
    newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)

Output:

I went up to Winterstreamrise

Explanation:

The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.

Answer from constantstranger on Stack Overflow

Python

docs.python.org › 3 › library › difflib.html

difflib — Helpers for computing deltas

difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)¶

Educative

educative.io › answers › how-to-use-getclosematches-in-python

How to use get_close_matches() in Python

We can use this function to get a list of “good matches” for a particular word. ... n: This is an optional parameter with a default value of 3. It specifies the maximum number of close matches required.

Discussions

Using difflib.get_close_matches to replace word in string - Python - Stack Overflow

If difflib.get_close_matches can return a single close match. Where I supply the sample string and close match. How can I utilize the 'close match' to replace the string token found? # difflibQuest... More on stackoverflow.com

stackoverflow.com

python 2.7 - difflib.get_close_matches GET SCORE - Stack Overflow

Is there a better solution to match unicode strings? ... Found great library that can score similarity between 2 strings rapidly and accurately - fuzzywuzzy [link] (pypi.python.org/pypi/fuzzywuzzy) ... I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. More on stackoverflow.com

stackoverflow.com

string - How does the python difflib.get_close_matches() function work? - Stack Overflow

The following are two arrays: import difflib import scipy import numpy a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15') b1=numpy.... More on stackoverflow.com

stackoverflow.com

python - Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list? - Stack Overflow

I want to use something like difflib.get_close_matches but instead of the most similar strings, I would like to obtain the indexes (i.e. position in the list). The indexes of the list are more fl... More on stackoverflow.com

stackoverflow.com

reddit.com › r/learnpython › how does difflib.get_close_matches work ?

r/learnpython on Reddit: How does difflib.get_close_matches work ?

August 15, 2020 -

I recently started programming and I stumbled upon th difflib.get_close_matches function when I tried to come up with a way to give a close match upon entering an invalid statement. Now I implemented the function into my programm however I still don't fully understand how exactly this thing works. I only know that it looks for the most adjacent correct letters in all words and determines the close mathes that way. But how does it compare the letters with one antoher? Does it transform every word into a list and compares it with the input?

Top answer

1 of 1

It uses SequenceMatcher https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher There's a link to the source code at the top of the docs. https://github.com/python/cpython/tree/3.8/Lib/difflib.py Looks like the work is done in the __chain_b find_longest_match and get_matching_blocks methods. https://github.com/python/cpython/blob/3.8/Lib/difflib.py#L=281 https://github.com/python/cpython/blob/3.8/Lib/difflib.py#L=336 https://github.com/python/cpython/blob/3.8/Lib/difflib.py#L=446 Seems quite involved.

Stack Overflow

stackoverflow.com › questions › 72842260 › using-difflib-get-close-matches-to-replace-word-in-string-python

Using difflib.get_close_matches to replace word in string - Python - Stack Overflow

Top answer

1 of 1

You could try something like this:

import difflib

possibilities = ['Summerdalerise', 'Winterstreamrise']
line = 'I went up to Winterstreamrose.'

newWords = []
for word in line.split():
    result = difflib.get_close_matches(word, possibilities, n=1)
    newWords.append(result[0] if result else word)
result = ' '.join(newWords)
print(result)

Output:

I went up to Winterstreamrise

Explanation:

The docs show a first argument named word, and there is no suggestion that get_close_matches() has any awareness of sub-words within this argument; rather, it reports on the closeness of a match between this word atomically and the list of possibilities supplied as the second argument.
We can add the awareness of words within line by splitting it into a list of such words which we iterate over, calling get_close_matches() for each word separately and modifying the word in our result only if there is a match.

GitHub

github.com › python › cpython › blob › main › Lib › difflib.py

cpython/Lib/difflib.py at main · python/cpython

· Function get_close_matches(word, possibilities, n=3, cutoff=0.6): Use SequenceMatcher to return list of the best "good enough" matches. · Function context_diff(a, b): For two lists of strings, return a delta in context diff format.

Author python

Stack Overflow

stackoverflow.com › questions › 36283391 › difflib-get-close-matches-get-score

python 2.7 - difflib.get_close_matches GET SCORE - Stack Overflow

Top answer

1 of 3

I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. But there are a few other more advanced libraries like fuzzywuzzy as you mentioned in the comments.

But if you want to use difflib, you can use difflib.SequenceMatcher to get the score as follows:

import difflib
my_str = 'apple'
str_list = ['ape' , 'fjsdf', 'aerewtg', 'dgyow', 'paepd']
best_match = difflib.get_close_matches(my_str,str_list,1)[0]
score = difflib.SequenceMatcher(None, my_str, best_match).ratio()

In this example, the best match between 'apple' and the list is 'ape' and the score is 0.75.

You can also loop through the list and compute all the scores to check:

for word in str_list:
    print "score for: " + my_str + " vs. " + word + " = " + str(difflib.SequenceMatcher(None, my_str, word).ratio())

For this example, you get the following:

score for: apple vs. ape = 0.75
score for: apple vs. fjsdf = 0.0
score for: apple vs. aerewtg = 0.333333333333
score for: apple vs. dgyow = 0.0
score for: apple vs. paepd = 0.4

Documentation for difflib can be found here: https://docs.python.org/2/library/difflib.html

2 of 3

To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches() individually in this manner:

match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()

Here's a way that increases speed in my case by about 10% ...

I'm using get_close_matches() for spellcheck, it runs SequenceMatcher() under the hood but strips the scores returning just a list of matching strings. Normally.

But with a small change in Lib/difflib.py currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates are many.

Current mere list return

In this example result would normally be like ['apple', 'staple', 'able', 'lapel']

... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).

The change

in difflib.py is simple (this line to the right shows the original):

return {v: k for (k, v) in result}  # hack to return dict with scores instead of list, original was ... [x for score, x in result]

New dictionary return

includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}

>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']

Increasing minimum score cutoff/threshold from .4 to .8:

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}

Beautiful Soup

tedboy.github.io › python_stdlib › generated › generated › difflib.get_close_matches.html

difflib.get_close_matches() — Python Standard Library

difflib.get_close_matches() View page source · difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)[source]¶ · Use SequenceMatcher to return list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string).

DEV Community

dev.to › wangonya › difflib-finding-close-matches-of-strings-from-a-list-54jl

difflib - Finding close matches of strings from a list - DEV Community

September 22, 2019 - n (optional) - the maximum number of close matches to return. Must be > 0. Default is 3. cutoff (optional) - a float in the range [0, 1] that a possibility must score in order to be considered similar to word. 0 is very lenient, 1 is very strict. Default is 0.6. ... Python 3.7.3 >>> from difflib import get_close_matches >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) ['apple', 'ape']

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 10879247 › how-does-the-python-difflib-get-close-matches-function-work

string - How does the python difflib.get_close_matches() function work? - Stack Overflow

Top answer

1 of 3

Well, there is this part in the docs explaining your issue:

This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

For getting the results you are expecting you could use the Levenshtein_distance.

But for comparing IPs I would suggest to use integer comparison:

>>> parts = [int(s) for s in '198.124.252.130'.split('.')]
>>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
>>> from operator import sub
>>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
>>> diff
29

You can use this style to create a compare function:

from functools import partial
from operator import sub

def compare_ips(base, ip1, ip2):
    base = [int(s) for s in base.split('.')]
    parts1 = (int(s) for s in ip1.split('.'))
    parts2 = (int(s) for s in ip2.split('.'))
    test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
    test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
    return cmp(test1, test2)

base = '198.124.252.101'
test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
             '134.55.219.137','134.55.220.45', '198.124.252.130']
sorted(test_list, cmp=partial(compare_ips, base))
# yields:
# ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
#  '134.55.220.45', '134.55.41.41']

2 of 3

Some hint from difflib:

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

Regarding your requirement to compare IPs based on custom logic. You should first validate if the string is proper ip. Then writing comparison logic using simple integer arithmetic should be an easy task to fulfill your requirement. A library is not needed at all.

Wangonya

wangonya.com › blog › difflib

difflib - Finding close matches of strings from a list | Kelvin Wangonya

Stack Overflow

stackoverflow.com › questions › 50861237 › is-there-an-alternative-to-difflib-get-close-matches-that-returns-indexes-l

python - Is there an alternative to `difflib.get_close_matches()` that returns indexes (list positions) instead of a str list? - Stack Overflow

Top answer

1 of 2

I took the source code for get_close_matches, and modify it in order to return the indexes instead of the string values.

# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest

def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
    """Use SequenceMatcher to return a list of the indexes of the best 
    "good enough" matches. word is a sequence for which close matches 
    are desired (typically a string).
    possibilities is a list of sequences against which to match word
    (typically a list of strings).
    Optional arg n (default 3) is the maximum number of close matches to
    return.  n must be > 0.
    Optional arg cutoff (default 0.6) is a float in [0, 1].  Possibilities
    that don't score at least that similar to word are ignored.
    """

    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for idx, x in enumerate(possibilities):
        s.set_seq1(x)
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), idx))

    # Move the best scorers to head of list
    result = _nlargest(n, result)

    # Strip scores for the best n matches
    return [x for score, x in result]

Usage

>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6]

Now, I can relate this indexes to associated data of the string without having to search back the strings.

2 of 2

Not an exact answer to your question but I was trying to find a simpler single match index and the syntax is

match_string = difflib.get_close_matches(appx_name_str,names_list,n=1,cutoff=0.1)[0] # get the most similar string
match_index = names_list.index(match_string) # index method on list of strings

GitHub

github.com › python › cpython › issues › 84072

[difflib] Improve get_close_matches() to better match when casing of words are different · Issue #84072 · python/cpython

March 7, 2020 - 3.10only security fixesonly security fixesstdlibStandard Library Python modules in the Lib/ directoryStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancementA feature request or enhancement · brnglrmannequin · opened · on Mar 7, 2020 · Issue body actions · BPO · 39891 · Nosy · @malemburg, @tim-one, @rhettinger, @remilapeyre, @brnglr · PRs · bpo-39891: [difflib] add parameter 'ignorecase' to get_close_matches() #18983 ·

Author brnglr

Medium

ajinkya29.medium.com › what-is-difflib-41649066591c

What is Difflib?. So let's get started with this amazing… | by Ajinkya Mishrikotkar | Medium

June 14, 2021 - 2. difflib.get_close_matches : get_close_match is a function that returns a list of best matches keywords for a specific keyword.

Stack Overflow

stackoverflow.com › questions › 21494177 › difflib-get-close-matches-help-getting-desired-result

python - difflib.get_close_matches() - Help getting desired result - Stack Overflow

May 24, 2017 - The basic gist of the program is to start with a list of employee names, then sort it. Wait for user to input "end" to stop populating the list of names (I have 100 names, I cut it short for the example). Afterwards, the user can enter an employee name and the program will run difflib.get_close_matches().

Python

bugs.python.org › issue39891

Issue 39891: [difflib] Improve get_close_matches() to better match when casing of words are different - Python tracker

This issue tracker has been migrated to GitHub, and is currently read-only. For more information, see the GitHub FAQs in the Python's Developer Guide · This issue has been migrated to GitHub: https://github.com/python/cpython/issues/84072

Towards Data Science

towardsdatascience.com › home › latest › “find the difference” in python

"Find the Difference" in Python | Towards Data Science

January 21, 2025 - Now, with the Difflib, you potentially can implement this feature in your Python application very easily. The key is to use the get_close_matches() function.

GitHub

github.com › enthought › Python-2.7.3 › blob › master › Lib › difflib.py

Python-2.7.3/Lib/difflib.py at master · enthought/Python-2.7.3

Author enthought

Stack Overflow

stackoverflow.com › questions › 65702508 › python-difflib-get-close-matches-comparing-modified-text-but-returning-original

nlp - Python: difflib.get_close_matches comparing modified text but returning original - Stack Overflow

Top answer

1 of 1

Similarly to this question you can take and modify the source code of difflib.get_close_matches and adapt it to your need.

Modifications I made:

cutoff default value raised to 0.99 (theoretically it could even be 1.0 but to ensure numerical errors do not influence the results I am passing a smaller number).
s.set_seq1(x.lower()) - so that the comparison was done between lower-cased strings (but returned original x)

Full code of the modified function:

from difflib import SequenceMatcher, _nlargest  # necessary imports of functions used by modified get_close_matches

def get_close_matches_lower(word, possibilities, n=3, cutoff=0.99):
    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for x in possibilities:
        s.set_seq1(x.lower())  # lower-case for comparison
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), x))

    # Move the best scorers to head of list
    result = _nlargest(n, result)
    # Strip scores for the best n matches
    return [x for score, x in result]

And the result on the example you gave:

print(get_close_matches_lower('rfid alert', ['profile Caller','RFID alert']))

Printing:

['RFID alert']

Stack Overflow

stackoverflow.com › questions › 28232646 › whats-the-use-of-cutoff-argument-in-difflib-get-close-matches-example-in-python

What's the use of cutoff argument in difflib.get_close_matches example in python? - Stack Overflow

Top answer

1 of 2

I came across the same question and I found that "difflib.get_close_matches" uses as foundation the approach on called "Gestalt pattern matching" described by Ratcliff and Obershelp (link below).

The method "difflib.get_close_matches" is based on the class "SequenceMatcher", which in the source code specify this: "SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people."

About the "cutoff". This tells you how close you want to find the match, if "1" then it needs to be exactly the same word, and as going down it's more relax. So for instance, if you choose "0" it will for sure return you the most "similar" work no matter you don't have any similar one, so this would not make much sense on most of the cases. It's then "0.6" the default, as this can give significant results, but its up to any particular solution, you need to test what it works for you based on your vocabulary and specific scenario.

PATTERN MATCHING: THE GESTALT APPROACH http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm

Hope this helps you to understand "difflib.get_close_matches" better.

2 of 2

From the documentation:

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

Trying the example from the documentation:

In [11]: import difflib

In [12]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
Out[12]: ['apple', 'ape']

In [13]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.1)
Out[13]: ['apple', 'ape', 'puppy']

In [14]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.9)
Out[14]: []

Details about the algorithm are noted in the article "Pattern Matching: The Gestalt Approach".