Implementing “Did you mean …?” Suggestions
Introduction
A lot of software that parses commands (such as git
) or compilers, when you specify something that’s unknown, perhaps by having made a typo, offer “did you mean …?” suggestions to be helpful. For example, given the following C program:
#include <stdio.h>
int main() {
print( "hello, world!n" );
}
clang
will print:
oops.c:4:3: error: call to undeclared function 'print'
4 | print( "hello, world!n" );
| ^
oops.c:4:3: note: did you mean 'printf'?
Specifically, it prints other in-scope identifiers that are similar to the unknown one — what I call did-you-mean suggestions (DYMS).
I wanted to add DYMS to cdecl
. It turns out that the canonical way to do this is by using something called the Damerau–Levenshtein distance (DLD) that is:
… the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.
Hence, to go from print
to printf
, the distance would be 1 (insert 'f'
).
In this article, I’m not going to discuss the details of the algorithm. For that, you can refer to the Wikipedia page or this better explanation. Instead, I’m going to discuss how to incorporate DLD into programs efficiently and effectively.
Damerau–Levenshtein API
Typically, DLD is implemented in a single function having the signature:
size_t dam_lev_dist( char const *unknown, char const *known );
That is, it returns the DLD to get from unknown to known. The thing is that the function is never called just once; it’s called repeatedly in a loop, once for each known keyword you’re trying to determine whether it’s what the user meant by unknown.
If you look at implementations, they all allocate arrays of char
internally for partial-result storage, then deallocate them. If there are n keywords, that means at least n calls to malloc
and free
.
Better, would be to calculate the maximum amount of memory to allocate for all n keywords once, allocate it before calling dam_lev_dist
, then calling free
once after we’re done. For that, we can use the matrix2d_new
function:
void* dam_lev_new( size_t unknown_len, size_t max_known_len ) {
return matrix2d_new(
sizeof(size_t), alignof(size_t),
unknown_len + 2, max_known_len + 2
);
}
(The reason for the + 2
s is to include extra rows and columns for both 0 and n so accessing m[i][j] doesn’t need to be bounds-checked.) The API then changes to:
size_t dam_lev_dist( void *working_mem,
char const *unknown, size_t ulen,
char const *known, size_t klen );
Limited Damerau–Levenshtein Distance
While a DLD between two strings is generally useful, it’s not useful specifically for offering suggestions. For example, given an unknown string of “fixed” and a known string of “float” (having a DLD of 4 between them) it’s probably safe to assume that because “fixed” is so different from “float” that there’s no way “float” was meant, so suggesting “float” for “fixed” would seem weird. It would be better to offer no suggestions than not-even-close suggestions.
Hence, DLDs are useful for suggestions only when they’re limited (not too big). Intuitively, a useful DLD should be relative to the length of the known string: a DLD of 4 when the length is 5 is too big (80%), but perhaps not when the length is, say, 10 (40%). Hence, a suggestion is reasonable only when DLD ≤ some percentage of known_len
:
bool is_similar_enough( did_you_mean const *dym ) {
return dym->dam_lev_dist <= (size_t)(dym->known_len * SIMILAR_ENOUGH_PERCENT + 0.5);
}
Empirically, I determined the following to be a good value for SIMILAR_ENOUGH_PERCENT
for cdecl
:
static double const SIMILAR_ENOUGH_PERCENT = .37;
Did-You-Mean API
A did-you-mean API would start with:
struct did_you_mean {
char const *known; // Known candidate.
size_t known_len; // Length of known.
size_t dam_lev_dist; // Calculated DLD.
void *user_data; // Optional user data.
};
typedef struct did_you_mean did_you_mean_t;
User code would dynamically allocate an array of did_you_mean
and set each known
to a known keyword. For example, if the user typed an unknown command of explai
(missing the n
), then cdecl
would allocate an array of did_you_mean
where each known
is a known cdecl
command, one of cast
, declare
, define
, expand
, explain
, help
, include
, set
, show
, typedef
, or exit
. The convention we’ll use is that the array will be one longer than the list of known keywords so the last element can have a known
of NULL
to mark the end.
Given that, we can define:
typedef void (*dym_cleanup_fn_t)( did_you_mean_t const* );
typedef bool (*dym_similar_fn_t)( did_you_mean_t const* );
bool dym_calc( char const *unknown,
did_you_mean_t *dym_array,
dym_similar_fn_t similar_fn,
dym_cleanup_fn_t cleanup_fn );
The function will calculate the DLD between unknown
and each known
in the array putting the result in dam_lev_dist
. For similar_fn
, we can pass &is_similar_enough
; cleanup_fn
will be discussed shortly.
The implementation starts off as:
did_you_mean_t *dym;
size_t max_known_len = 0;
// calculate the maximum known lengths
for ( dym = dym_array; dym->known != NULL; ++dym ) {
dym->known_len = strlen( dym->known );
if ( dym->known_len > max_known_len )
max_known_len = dym->known_len;
}
// calculate Damerau-Levenshtein distance for all candidates
size_t const unknown_len = strlen( unknown );
void *const dl_mem = dam_lev_new( unknown_len, max_known_len );
for ( dym = dym_array; dym->known != NULL; ++dym ) {
dym->dam_lev_dist = dam_lev_dist(
dl_mem, unknown, unknown_len, dym->known, dym->known_len
);
}
free( dl_mem );
// sort by Damerau-Levenshtein distance
size_t const dym_size = (size_t)(dym - dym_array);
qsort(
dym_array, dym_size, sizeof dym_array[0],
(qsort_cmp_fn_t)&dym_cmp
);
That’s all fairly straightforward. The dym_cmp
function compares two did_you_mean
elements for qsort
:
static int dym_cmp( did_you_mean_t const *i_dym,
did_you_mean_t const *j_dym ) {
ssize_t const cmp =
(ssize_t)i_dym->dam_lev_dist - (ssize_t)j_dym->dam_lev_dist );
return cmp != 0 ? (int)cmp :
strcmp( i_dym->known, j_dym->known );
}
The remainder of the dym_calc
implementation is:
if ( dym_array->dam_lev_dist > 0 ) {
bool found_at_least_1 = false;
for ( dym = dym_array; dym->known != NULL; ++dym ) {
if ( !(*similar_fn)( dym ) )
break;
found_at_least_1 = true;
}
if ( found_at_least_1 ) {
//
// Free known literals past the best ones and set
// the one past the last to NULL to mark the end.
//
dym_cleanup_all( dym, cleanup_fn );
*dym = (did_you_mean_t){ 0 };
return true;
}
}
If dym_array[0].dam_lev_dist
is zero, it means unknown
was actually a known keyword. (In this case, offering suggestions shouldn’t even be done.) Otherwise, iterate through the dym_array
looking for the first known
that is not similar enough according to similar_fn
, stop there, call the clean_up
function to clean-up array elements from there until the end, then mark the first not-similar-enough as the new end by assigning 0
.
The dym_cleanup_all
function cleans-up all did_you_mean
elements starting from dym
until the end:
static void dym_cleanup_all( did_you_mean_t const *dym,
dym_cleanup_fn_t cleanup_fn ) {
assert( dym != NULL );
if ( cleanup_fn == NULL )
return;
while ( dym->known != NULL )
(*cleanup_fn)( dym++ );
}
The cleanup_fn
is needed to clean-up did_you_mean
array elements, for example, if known
were dynamically allocated, then cleanup_fn
would free
it.
If found_at_least_1
is false
, then:
dym_free( dym_array, cleanup_fn );
return false;
That is, we free the entire dym_array
. Lastly, we need:
void dym_free( did_you_mean_t const *dym_array,
dym_cleanup_fn_t cleanup_fn ) {
if ( dym_array != NULL ) {
dym_cleanup_all( dym_array, cleanup_fn );
free( (void*)dym_array );
}
}
that user code must call after giving all suggestions.
Conclusion
The Damerau–Levenshtein distance is a relatively simple algorithm for offering did-you-mean suggestions, but it can’t be used by itself in order to offer good suggestions. Including such suggestions in programs make them better for users.
Epilogue
Some implementations include an alternate scoring function in addition to dam_lev_dist
that checks to see if known is a prefix of unknown, e.g., uint
is a prefix of uint8_t
: if so, it replaces what dam_lev_dist
would have calculated and returns 1
. This technique can often include more good suggestions.
I didn’t incorporate this into cdecl
because it’s sometimes too good, e.g., uint
would return uint8_t
, uint16_t
, uint32_t
, uint64_t
, uintmax_t
, uintptr_t
, uint_fast8_t
, …, uint_least8_t
, …, that’s a bit much. It’s dependent on the particular application. If you want to add it, it’s left as an exercise for the reader.
The relevant source file from cdecl
are:
The dam_lev
and did_you_mean
files are generic; only the cdecl_dym
files are specific to cdecl
.