Levenshtein distance (Edit distance)

정의

한 단어를 다른 하나로 바꾸기 위해 글자를 수정(edit)해야 하는 횟수를 계산하여 유사도를 판단하는 것을 edit distance라고 한다. edit operation을 어디까지 1개로 볼 것인가에 따라 여러가지 알고리즘이 있는데 Levenshtein에서는 글자의 삽입(insertion), 제거(deletion), 대체(substitution) 3가지를 1번의 edit으로 본다.

예를 들면, lev(”hello”, “shallow”) = 3이다.
insertion 2번 (제일 앞에 s 추가, 마지막에 w 추가)과 substitution 1번 (e→a)

직관적으로 값을 구하는 것은 쉽지만, 어떻게 코드로 어떻게 구현해야할지 언뜻 생각해보면 꽤나 생각해볼 게 많은 문제임을 알 수 있다. 아래는 엄밀하게 표현한 수학적 정의.

head는 단어의 첫 글자, tail은 head를 뺀 나머지
a의 길이가 0이면 0, b의 길이가 0이면 0
head가 동일하면, tail끼리 lev dist를 구한다.
head가 같지 않다면, 1을 더하고 어떤 edit을 해야 하는 경우인지 확인한다.
(from “a” to “b”로 생각)
- lev(tail(a), b)가 가장 작음 → a의 head를 delete
- lev(a, tail(b))가 가장 작음 → a의 head에 insert
- lev(tail(a), tail(b))가 가장 작음 → a의 head를 substitute
이를 반복해 나간다.

알고리즘 구현

Naive recursive

위 정의 그대로 recursive하게 구현한다…;; 매 글자마다 edit 후 뒤에 오는 문자열 조합의 lev dist를 모두 계산해봐야 하기 때문에 중복 계산이 매우 많이 발생한다.

Python code

def levenshtein_naive(x:str, y:str, print_en:bool):

    x = list(x)
    y = list(y)
    score = 0

    while True:
        if x == y :
            break
        else:
            x_org = ''.join(x)
            y_org = ''.join(y)
            x_tail = ''.join(x[1:])
            y_tail = ''.join(y[1:])

            if len(x) == 0 and len(y) > 0 : # only X has string
                for _ in range(len(y)):
                    score += 1
                    if print_en : print(f"x_head :   | y_head : {y[0]} --> EDIT (INSERT)")
                    y = y_tail
                break
            elif len(y) == 0 and len(x) > 0 : # only Y has string
                for _ in range(len(x)):
                    score += 1
                    if print_en : print(f"x_head : {x[0]} | y_head :   --> EDIT (DELETE)")
                    x = x_tail
                break
            elif x[0] == y[0]:
                if print_en : print(f"x_head : {x[0]} | y_head : {y[0]} --> SAME")
                x = x[1:]
                y = y[1:]
            else:
                score += 1
                delete = levenshtein_naive(x_tail, y_org, False)
                insert = levenshtein_naive(x_org , y_tail, False)
                substitute = levenshtein_naive(x_tail, y_tail, False)

                if delete < insert and delete < substitute:
                    if print_en : print(f"x_head : {x[0]} | y_head :   --> EDIT (DELETE)")
                    x = x_tail
                    y = y_org
                elif insert < delete and insert < substitute:
                    if print_en : print(f"x_head :   | y_head : {y[0]} --> EDIT (INSERT)")
                    x = x_org
                    y = y_tail
                else: #if substitute < delete and substitute < insert:
                    if print_en : print(f"x_head : {x[0]} | y_head : {y[0]} --> EDIT (SUBSTITUTE)")
                    x = x_tail
                    y = y_tail

    return score

Wagner-Fischer algorithm (Iterative with full matrix)

첫 번째 방법의 redundant calculation을 피하기 위하여 Dynamic programming 을 이용해 최적화한 알고리즘. flood filling 방식으로 모든 prefix들 사이의 edit distance를 구해 matrix를 완성한다. 마지막 계산값이 full string의 distance가 된다.

Matrix filling 방법

0) Distance matrix의 구성은 아래와 같다.

Matrix size : {word1의 length}+1 x {word2의 length}+1
- 각 dimension의 첫 칸을 null을 위해 비운다.
lev(i,j) : matrix (i, j)의 값은 X[0:i]를 Y[0:j]로 바꾸기 위해 필요한 edit 횟수를 의미한다.

1) 0-th row/col 채우기 (optimal substructure의 초기값)

두 문자열 모두 첫 글자는 항상 null이므로, lev(0,0)는 항상 0이다.
0-th row : null이 Y[0:j]가 되려면 j만큼 insert해야 한다. 따라서, lev(0, j)는 항상 j 이다.
0-th col : X[0:i]가 null이 되려면 i만큼 delete해야 한다. 따라서, lev(i:0)는 항상 i 이다.

⇒ 어떤 문자열이던 항상 lev(i,0) = i , lev(0:j) = j으로 초기화된다. 이렇게 항상 정해진 값을 가지는 첫 행과 열은 점화식 형태를 가지는 dynamic programming에서 초기값의 역할을 하게 된다.

2) 2x2 행렬을 반복해서 채워가기

위 규칙에 따라 distance matrix을 왼쪽 위부터 오른쪽 아래 방향으로 채워나간다.
- 초기값을 채워 놓았기 때문에 구하려는 값을 (1,1)로 가지는 2x2 matrix의 나머지 값을 모두 알 수 있다.
- lev(i-1, j) + 1 : 2x2에서 우상단 값 + 1. delete하는 경우.
- lev(i, j-1) + 1 : 2x2에서 좌하단 값 + 1. insert하는 경우.
- lev(i-1, j-1) +1
  - X[i] = Y[j] 이면, 2x2에서 좌상단 값을 그대로 가져온다. edit이 필요없는 경우.
  - X[i] ≠ Y[j] 이면, 2x2에서 좌상단 값 + 1. substitute하는 경우.
(예시) lev(1, 1)
X[1] ≠ Y[1] → lev(1,1) = min(lev(0,1) + 1, lev(1,0) + 1, lev(0,0) + 1 = min(2, 2, 1) = 1

(예시) lev(1, 2)
X[1] = Y[2] → lev(1,2) = min(lev(0,2) + 1, lev(1,0) + 1, lev(0,1) = min(3, 2, 1) = 1

완성된 distance matrix. 가장 마지막 값이 최종 Levenshtein distance가 된다.

Python code

def levenshtein_dp(x:str, y:str, print_en:bool):

    x = list(x)
    y = list(y)

    # 0) Create distance matrix
    x.insert(0, " ")
    y.insert(0, " ")
    row = len(x)
    col = len(y)
    lev = [[0 for _ in range(col)] for _ in range(row)]

    # 1) Fill 0-th row/col (optimal substructure initialization)
    for i in range(row):
        lev[i][0] = i

    for j in range(col):
        lev[0][j] = j

    # 2) Full matrix filling
    for i in range(1, row):
        for j in range(1, col):
            delete = lev[i-1][j] + 1
            insert = lev[i][j-1] + 1
            sub_or_same = lev[i-1][j-1] + (x[i]!=y[j])

            if delete < insert and delete < sub_or_same:
                if print_en : print(f"x_head : {x[i]} | y_head :   --> EDIT (DELETE)")
                lev[i][j] = delete
            elif insert < delete and insert < sub_or_same:
                if print_en : print(f"x_head :   | y_head : {y[i]} --> EDIT (INSERT)")
                lev[i][j] = insert
            else:
                if print_en and x[i]==y[j] : print(f"x_head : {x[i]} | y_head : {y[j]} --> SAME")
                if print_en and x[i]!=y[j] : print(f"x_head : {x[i]} | y_head : {y[j]} --> EDIT (SUBSTITUTE)")
                lev[i][j] = sub_or_same

    # 3) Final distance
    for r in lev : print(r)
    score = lev[row-1][col-1]
    return score

Naive vs. Dynamic programming 결과 비교

# Dynamic programming
## Distance matrix
[0, 1, 2, 3, 4, 5, 6, 7]
[1, 1, 1, 2, 3, 4, 5, 6]
[2, 2, 2, 2, 3, 4, 5, 6]
[3, 3, 3, 3, 2, 3, 4, 5]
[4, 4, 4, 4, 3, 2, 3, 4]
[5, 5, 5, 5, 4, 3, 2, 3]

edit distance (dp) = 3 | execution time = 0.0008s
Current memory usage: 0.0026 MB; Peak: 0.0130 MB

# Naive recursive
## edits
x_head :   | y_head : s --> EDIT (INSERT)
x_head : h | y_head : h --> SAME
x_head : e | y_head : a --> EDIT (SUBSTITUTE)
x_head : l | y_head : l --> SAME
x_head : l | y_head : l --> SAME
x_head : o | y_head : o --> SAME
x_head :   | y_head : w --> EDIT (INSERT)

edit distance (naive) = 3 | execution time = 0.0244s
Current memory usage: 0.0095 MB; Peak: 0.0199 MB

execution time이 naive recursive 대비 dp가 매우 작은 것을 확인할 수 있다.
Memory usage 역시 dp가 살짝 작다.

Reference

https://en.wikipedia.org/wiki/Levenshtein_distance#Recursive

https://en.wikipedia.org/wiki/Edit_distance

https://en.wikipedia.org/wiki/Wagner–Fischer_algorithm

https://blog.paperspace.com/measuring-text-similarity-using-levenshtein-distance/

⚠️ Disclaimer 혼자 공부한 내용을 정리한 글이므로 잘못된 내용이 있을 수 있습니다.
틀린 내용이나 더 좋은 방법이 있는 경우, 댓글로 알려주시면 업데이트하겠습니다.

저작자표시 비영리 변경금지 (새창열림)

'Data Science' 카테고리의 다른 글

[DSA 101] 4. Sorting (0)	2025.04.15
[DSA 101] 3. Iteration vs. Recursion (0)	2025.04.13
[DSA 101] 2. Searching algorithm (0)	2025.04.13
[DSA 101] 1. Computation Complexity와 Big-O (1)	2025.04.13

Levenshtein distance (Edit distance)

정의

알고리즘 구현

Naive recursive

Python code

Wagner-Fischer algorithm (Iterative with full matrix)

Matrix filling 방법

Python code

Naive vs. Dynamic programming 결과 비교

Reference

'Data Science' 카테고리의 다른 글

관련글

티스토리툴바