Recursion

Levenshtein Distance

Write a function called levenshtein_distance that takes as input two strings and returns the Levenshtein distance between the two strings. Intuitively, the Levenshtein distance is the minimum number of edit operations to transform one string into the other (for this reason Levenshtein distance is sometimes called “edit distance”). These edits can either be insertions, deletions, or substitutions. Note that Levenshtein distance is similar to Hamming distance, but works for strings of differing lengths

Here are some examples of these operations:

  1. kittensitten (substitution of s for k)
  2. sittensittin (substitution of i for e)
  3. sittinsitting (insertion of g at the end).

While this function seems initially daunting, it admits a very compact recursive solution. You can either work on your own to see the recursive solution, or use the recursive solution given in the Wikipedia article.

We will be memoizing this function in the next section to make it more computationally efficient.

More Recursion

Let’s circle back on some of the recursion practice problems from last time. We’ll start by implementing Levenshtein distance.

To get a better handle on this, let’s consider some more examples.

levenshtein_distance('kitten', 'smitten') -> 2 (see below for steps)

  1. kittensitten (k gets replaced by s)
  2. sittensmitten (insert between s and i)

levenshtein_distance('beta', 'pedal') -> 3 (see below for steps)

  1. betapeta (b gets replaced by p)
  2. petapetal (l gets inserted at the end)
  3. petalpedal (t gets replaced by d)

levenshtein_distance('battle', 'bet') -> 4 (see below for steps)

  1. battlebettle (a gets replaced by e)
  2. bettlebettl (the last e gets deleted)
  3. bettlbett (delete l)
  4. bettbet (delete t)

Base Cases

Let’s consider the base cases when one of the two strings is empty. What should the Levenshtein distance be in this case?

Recursive Step

Let’s consider the different ways in which we can make the first character of string a equal to the first character of string b. Here are the possible cases.

For each of these steps we have to consider two things:

  1. How much does this first step cost?
  2. How much does it cost to make the rest of the two strings equal to each other

Memoization

In SoftDes, a lot of you had the chance to do this exercise:

Write a function called choose that takes two integer, n and k, and returns the number of ways to choose k items from a set of n (this is also known as the number of combinations of k items from a pool of n). Your solution should be implemented recursively using Pascal’s rule.

Here is a sample solution:

def nchoosek(n, k):
    """Return the number of combinations of size k that can be made from n items.

    Examples:
      >>> nchoosek(5, 3)
      10
      >>> nchoosek(1, 1)
      1
      >>> nchoosek(4, 2)
      6
    """
    if k == 0:
        return 1
    if n == k:
        return 1
    return nchoosek(n - 1, k - 1) + nchoosek(n - 1, k)

if __name__ == '__main__':
    import doctest
    doctest.testmod(verbose=True)

It passes all the unit tests!!! Hooray!

Unfortunately, this code is going to be quite slow. To get a sense of it, we will draw a tree that shows the recursive pattern of the function.

In order to improve the speed of this code, we can make use of a pattern called memoization. The basic idea is to transform a recursive implementation of a function to make use of a cache (in this case a Python dictionary) that remembers all previously computed values for the function. Here’s is a skeleton of a memoized recursive function (we are being a little fast and loose with the mixing pseudo code and Python, but this should become clear when we do a concrete example.

def recursive_function(input1, input2):
    if input1, input2 is a base case:
        return base case result
    if input1, input2 is in the list of already computed answers
        return precomputed answer
    return recursive step on a smaller version of input1, input2

Next, try to modify our function nchoosek to use memoization.

The call graphs of the memoized nchoosek and Levenshtein functions are here.