Implementing “Did you mean …?” Suggestions

Introduction

A lot of software that parses commands (such as git) or compilers, when you specify something that’s unknown, perhaps by having made a typo, offer “did you mean …?” suggestions to be helpful. For example, given the following C program:

#include <stdio.h>

int main() {
  print( "hello, world!n" );
}

clang will print:

oops.c:4:3: error: call to undeclared function 'print'
    4 |   print( "hello, world!n" );
      |   ^
oops.c:4:3: note: did you mean 'printf'?

Specifically, it prints other in-scope identifiers that are similar to the unknown one — what I call did-you-mean suggestions (DYMS).

I wanted to add DYMS to cdecl. It turns out that the canonical way to do this is by using something called the Damerau–Levenshtein distance (DLD) that is:

… the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

Hence, to go from print to printf, the distance would be 1 (insert 'f').

In this article, I’m not going to discuss the details of the algorithm. For that, you can refer to the Wikipedia page or this better explanation. Instead, I’m going to discuss how to incorporate DLD into programs efficiently and effectively.

Damerau–Levenshtein API

Typically, DLD is implemented in a single function having the signature:

size_t dam_lev_dist( char const *unknown, char const *known );

That is, it returns the DLD to get from unknown to known. The thing is that the function is never called just once; it’s called repeatedly in a loop, once for each known keyword you’re trying to determine whether it’s what the user meant by unknown.

If you look at implementations, they all allocate arrays of char internally for partial-result storage, then deallocate them. If there are n keywords, that means at least n calls to malloc and free.

Better, would be to calculate the maximum amount of memory to allocate for all n keywords once, allocate it before calling dam_lev_dist, then calling free once after we’re done. For that, we can use the matrix2d_new function:

void* dam_lev_new( size_t unknown_len, size_t max_known_len ) {
  return matrix2d_new(
    sizeof(size_t), alignof(size_t),
    unknown_len + 2, max_known_len + 2
  );
}

(The reason for the + 2s is to include extra rows and columns for both 0 and n so accessing m[i][j] doesn’t need to be bounds-checked.) The API then changes to:

size_t dam_lev_dist( void *working_mem,
                     char const *unknown, size_t ulen,
                     char const *known, size_t klen );

Limited Damerau–Levenshtein Distance

While a DLD between two strings is generally useful, it’s not useful specifically for offering suggestions. For example, given an unknown string of “fixed” and a known string of “float” (having a DLD of 4 between them) it’s probably safe to assume that because “fixed” is so different from “float” that there’s no way “float” was meant, so suggesting “float” for “fixed” would seem weird. It would be better to offer no suggestions than not-even-close suggestions.

Hence, DLDs are useful for suggestions only when they’re limited (not too big). Intuitively, a useful DLD should be relative to the length of the known string: a DLD of 4 when the length is 5 is too big (80%), but perhaps not when the length is, say, 10 (40%). Hence, a suggestion is reasonable only when DLD ≤ some percentage of known_len:

bool is_similar_enough( did_you_mean const *dym ) {
  return dym->dam_lev_dist <= (size_t)(dym->known_len * SIMILAR_ENOUGH_PERCENT + 0.5);
}

Empirically, I determined the following to be a good value for SIMILAR_ENOUGH_PERCENT for cdecl:

static double const SIMILAR_ENOUGH_PERCENT = .37;

Did-You-Mean API

A did-you-mean API would start with:

struct did_you_mean {
  char const *known;         // Known candidate.
  size_t      known_len;     // Length of known.
  size_t      dam_lev_dist;  // Calculated DLD.
  void       *user_data;     // Optional user data.
};
typedef struct did_you_mean did_you_mean_t;

User code would dynamically allocate an array of did_you_mean and set each known to a known keyword. For example, if the user typed an unknown command of explai (missing the n), then cdecl would allocate an array of did_you_mean where each known is a known cdecl command, one of cast, declare, define, expand, explain, help, include, set, show, typedef, or exit. The convention we’ll use is that the array will be one longer than the list of known keywords so the last element can have a known of NULL to mark the end.

Given that, we can define:

typedef void (*dym_cleanup_fn_t)( did_you_mean_t const* );
typedef bool (*dym_similar_fn_t)( did_you_mean_t const* );

bool dym_calc( char const *unknown,
               did_you_mean_t *dym_array,
               dym_similar_fn_t similar_fn,
               dym_cleanup_fn_t cleanup_fn );

The function will calculate the DLD between unknown and each known in the array putting the result in dam_lev_dist. For similar_fn, we can pass &is_similar_enough; cleanup_fn will be discussed shortly.

The implementation starts off as:

did_you_mean_t *dym;
size_t max_known_len = 0;

// calculate the maximum known lengths
for ( dym = dym_array; dym->known != NULL; ++dym ) {
  dym->known_len = strlen( dym->known );
  if ( dym->known_len > max_known_len )
    max_known_len = dym->known_len;
}

// calculate Damerau-Levenshtein distance for all candidates
size_t const unknown_len = strlen( unknown );
void *const dl_mem = dam_lev_new( unknown_len, max_known_len );
for ( dym = dym_array; dym->known != NULL; ++dym ) {
  dym->dam_lev_dist = dam_lev_dist(
    dl_mem, unknown, unknown_len, dym->known, dym->known_len
  );
}
free( dl_mem );

// sort by Damerau-Levenshtein distance
size_t const dym_size = (size_t)(dym - dym_array);
qsort(
  dym_array, dym_size, sizeof dym_array[0],
  (qsort_cmp_fn_t)&dym_cmp
);

That’s all fairly straightforward. The dym_cmp function compares two did_you_mean elements for qsort:

static int dym_cmp( did_you_mean_t const *i_dym,
                    did_you_mean_t const *j_dym ) {
  ssize_t const cmp =
   (ssize_t)i_dym->dam_lev_dist - (ssize_t)j_dym->dam_lev_dist );
  return  cmp != 0 ? (int)cmp :
          strcmp( i_dym->known, j_dym->known );
}

The remainder of the dym_calc implementation is:

if ( dym_array->dam_lev_dist > 0 ) {
  bool found_at_least_1 = false;

  for ( dym = dym_array; dym->known != NULL; ++dym ) {
    if ( !(*similar_fn)( dym ) )
      break;
    found_at_least_1 = true;
  }

  if ( found_at_least_1 ) {
    //
    // Free known literals past the best ones and set
    // the one past the last to NULL to mark the end.
    //
    dym_cleanup_all( dym, cleanup_fn );
    *dym = (did_you_mean_t){ 0 };
    return true;
  }
}

If dym_array[0].dam_lev_dist is zero, it means unknown was actually a known keyword. (In this case, offering suggestions shouldn’t even be done.) Otherwise, iterate through the dym_array looking for the first known that is not similar enough according to similar_fn, stop there, call the clean_up function to clean-up array elements from there until the end, then mark the first not-similar-enough as the new end by assigning 0.

The dym_cleanup_all function cleans-up all did_you_mean elements starting from dym until the end:

static void dym_cleanup_all( did_you_mean_t const *dym,
                             dym_cleanup_fn_t cleanup_fn ) {
  assert( dym != NULL );
  if ( cleanup_fn == NULL )
    return;
  while ( dym->known != NULL )
    (*cleanup_fn)( dym++ );
}

The cleanup_fn is needed to clean-up did_you_mean array elements, for example, if known were dynamically allocated, then cleanup_fn would free it.

If found_at_least_1 is false, then:

dym_free( dym_array, cleanup_fn );
return false;

That is, we free the entire dym_array. Lastly, we need:

void dym_free( did_you_mean_t const *dym_array,
               dym_cleanup_fn_t cleanup_fn ) {
  if ( dym_array != NULL ) {
    dym_cleanup_all( dym_array, cleanup_fn );
    free( (void*)dym_array );
  }
}

that user code must call after giving all suggestions.

Conclusion

The Damerau–Levenshtein distance is a relatively simple algorithm for offering did-you-mean suggestions, but it can’t be used by itself in order to offer good suggestions. Including such suggestions in programs make them better for users.

Epilogue

Some implementations include an alternate scoring function in addition to dam_lev_dist that checks to see if known is a prefix of unknown, e.g., uint is a prefix of uint8_t: if so, it replaces what dam_lev_dist would have calculated and returns 1. This technique can often include more good suggestions.

I didn’t incorporate this into cdecl because it’s sometimes too good, e.g., uint would return uint8_t, uint16_t, uint32_t, uint64_t, uintmax_t, uintptr_t, uint_fast8_t, …, uint_least8_t, …, that’s a bit much. It’s dependent on the particular application. If you want to add it, it’s left as an exercise for the reader.

The relevant source file from cdecl are:

The dam_lev and did_you_mean files are generic; only the cdecl_dym files are specific to cdecl.

Similar Posts