How to compare a List of Names in Python Using the Jaro-Winkler Similarity Metric ?

Published: August 29, 2024

Tags: Python;

DMCA.com Protection Status

Introduction

When working with datasets that contain names, whether they be names of cities, people, or other entities, a common issue arises: slight variations in spelling or formatting. For example, you may encounter names like "John Doe" and "John A. Doe" or "New York" and "New York City" when merging multiple datasets. These minor differences can pose significant challenges in ensuring accurate data matching and integration.

This is where string similarity metrics come into play, offering powerful tools to compare and match names that are not exactly the same but are close enough in meaning. One such metric, the Jaro-Winkler Similarity, is particularly effective at identifying and quantifying the similarity between two strings, making it ideal for comparing names with slight differences.

In this article, we'll explore how to use the Jaro-Winkler similarity metric in Python to compare a list of names. We will leverage the textdistance library, which provides an easy-to-use implementation of this metric. Whether you're dealing with slight misspellings, different name formats, or other small variations, this approach can help you identify matching records with greater accuracy.

What is the Jaro-Winkler Similarity Metric?

The Jaro-Winkler similarity is a string metric that measures the similarity between two strings. It is an extension of the Jaro distance metric and is particularly useful for comparing short strings, such as names. The key idea behind Jaro-Winkler is that it gives more favorable ratings to strings that match from the beginning for a set prefix length, making it more effective for comparing names with minor differences.

Using the textdistance python library

Installing the Required Library

Before we dive into the code, you'll need to install the textdistance library, which provides a wide range of string similarity metrics, including Jaro-Winkler.

You can install it using pip:

1
pip install textdistance

Creating a List of Names

Let's start by creating a list of names that we want to compare. We'll use these names to demonstrate how to calculate the similarity between each pair using the Jaro-Winkler metric.

1
2
3
4
5
6
import pandas as pd
import textdistance

# Create a pandas DataFrame with some names
names = ['Martha', 'Marhta', 'Mark', 'Marta', 'Mathew', 'Matthew']
df = pd.DataFrame(names, columns=['Name'])

Calculating Jaro-Winkler Similarity

Now that we have our list of names, we can calculate the Jaro-Winkler similarity between each pair. The textdistance library makes this process straightforward.

We'll create a function to calculate the similarity between a given name and all other names in the list, and then apply this function across our DataFrame to generate a similarity matrix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Create a function to apply the Jaro-Winkler similarity to each pair of names
def calculate_similarity(row, df):
    return df['Name'].apply(lambda x: textdistance.jaro_winkler(row['Name'], x))

# Apply the function to the DataFrame
similarity_matrix = df.apply(lambda row: calculate_similarity(row, df), axis=1)

# Set the index and columns names for better readability
similarity_matrix.index = df['Name']
similarity_matrix.columns = df['Name']

print(similarity_matrix)

The output of the above code is a similarity matrix where each cell represents the Jaro-Winkler similarity score between two names. The values range from 0 to 1, where 1 indicates an exact match and values closer to 0 indicate less similarity.

Here’s an example of what the output might look like:

1
2
3
4
5
6
7
Martha    Marhta      Mark     Marta   Mathew  Matthew
Martha   1.000000  0.961111  0.744444  0.961111  0.813333  0.813333
Marhta   0.961111  1.000000  0.716667  0.925556  0.787778  0.787778
Mark     0.744444  0.716667  1.000000  0.755556  0.744444  0.744444
Marta    0.961111  0.925556  0.755556  1.000000  0.813333  0.813333
Mathew   0.813333  0.787778  0.744444  0.813333  1.000000  0.975556
Matthew  0.813333  0.787778  0.744444  0.813333  0.975556  1.000000

This similarity matrix can be used to identify potential matches or duplicates in your dataset. For example, you might consider any pair of names with a similarity score above a certain threshold (e.g., 0.9) as a likely match. This approach can be particularly useful in tasks such as data cleaning, record linkage, and merging datasets where consistent naming conventions are not guaranteed.

Implementing the Jaro-Winkler similarity

The Jaro-Winkler similarity is an extension of the Jaro similarity metric, which gives more favorable ratings to strings that match from the beginning for a set prefix length.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def jaro_distance(s1, s2):
    s1_len = len(s1)
    s2_len = len(s2)

    if s1_len == 0 or s2_len == 0:
        return 0.0

    match_distance = max(s1_len, s2_len) // 2 - 1

    s1_matches = [False] * s1_len
    s2_matches = [False] * s2_len

    matches = 0
    transpositions = 0

    # Finding matches
    for i in range(s1_len):
        start = max(0, i - match_distance)
        end = min(i + match_distance + 1, s2_len)

        for j in range(start, end):
            if s2_matches[j]:
                continue
            if s1[i] != s2[j]:
                continue
            s1_matches[i] = True
            s2_matches[j] = True
            matches += 1
            break

    if matches == 0:
        return 0.0

    k = 0
    for i in range(s1_len):
        if not s1_matches[i]:
            continue
        while not s2_matches[k]:
            k += 1
        if s1[i] != s2[k]:
            transpositions += 1
        k += 1

    transpositions /= 2

    return (matches / s1_len + matches / s2_len + (matches - transpositions) / matches) / 3.0

def jaro_winkler(s1, s2, p=0.1, max_l=4):
    jaro_dist = jaro_distance(s1, s2)

    # Find the length of common prefix up to a max of max_l
    prefix_length = 0
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            prefix_length += 1
        else:
            break
        if prefix_length == max_l:
            break

    return jaro_dist + (prefix_length * p * (1 - jaro_dist))

# Example Usage:
s1 = "MARTHA"
s2 = "MARHTA"
print(f"Jaro-Winkler Similarity: {jaro_winkler(s1, s2)}")

Output:

1
Jaro-Winkler Similarity: 0.9611111111111111

You can create a pandas DataFrame of names and then apply the Jaro-Winkler similarity function to calculate the similarity between each pair of names. Below is an example of how to do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Create a pandas DataFrame with some names
names = ['Martha', 'Marhta', 'Mark', 'Marta', 'Mathew', 'Matthew']
df = pd.DataFrame(names, columns=['Name'])

# Create a function to apply the Jaro-Winkler similarity to each pair of names
def calculate_similarity(row, df):
    return df['Name'].apply(lambda x: jaro_winkler(row['Name'], x))

# Apply the function to the DataFrame
similarity_matrix = df.apply(lambda row: calculate_similarity(row, df), axis=1)

# Set the index and columns names for better readability
similarity_matrix.index = df['Name']
similarity_matrix.columns = df['Name']

print(similarity_matrix)

The output will be a matrix with names as both rows and columns, with each cell representing the Jaro-Winkler similarity between the names:

1
2
3
4
5
6
7
Martha    Marhta      Mark     Marta   Mathew  Matthew
Martha   1.000000  0.961111  0.744444  0.961111  0.813333  0.813333
Marhta   0.961111  1.000000  0.716667  0.925556  0.787778  0.787778
Mark     0.744444  0.716667  1.000000  0.755556  0.744444  0.744444
Marta    0.961111  0.925556  0.755556  1.000000  0.813333  0.813333
Mathew   0.813333  0.787778  0.744444  0.813333  1.000000  0.975556
Matthew  0.813333  0.787778  0.744444  0.813333  0.975556  1.000000

This matrix allows you to see the similarity between any pair of names in the DataFrame.

Conclusion

The Jaro-Winkler similarity metric is a powerful tool for comparing names that may have slight variations in spelling or formatting. By using the textdistance library in Python, you can easily implement this metric to compare a list of names, helping you to accurately identify matches and clean your data.

Whether you're working with names of people, cities, or other entities, this method can significantly improve the accuracy of your data merging and matching processes.

References

Links Source
Jaro–Winkler distance en.wikipedia.org
Jaro, M. A. (1989). "Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida." Journal of the American Statistical Association, 84(406), 414-420. American Statistical Association
Winkler, W. E. (1990). "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage." Proceedings of the Section on Survey Research Methods (American Statistical Association), 354–359. American Statistical Association
Jurafsky, D., & Martin, J. H. (2008). Speech and Language Processing. (2nd Edition). Pearson Prentice Hall Speech and Language Processing. (2nd Edition)
textdistance pypi.org
textdistance conda-forge anaconda.org