# Tutorial :Match sub-string within a string with tolerance of 1 character mismatch

### Question:

I was going through some Amazon interview questions on CareerCup.com, and I came across this interesting question which I haven't been able to figure out how to do. I have been thinking on this since 2 days. Either I am taking a way off approach, or its a genuinely hard function to write.

Question is as follows:

Write a function in C that can find if a string is a sub-string of another. Note that a mismatch of one character should be ignored.

``A mismatch can be an extra character: â€™dogâ€™ matches â€˜xxxdoogyyyyâ€™    A mismatch can be a missing character: â€™dogâ€™ matches â€˜xxxdgyyyyâ€™   A mismatch can be a different character: â€™dogâ€™ matches â€˜xxxdigyyyyâ€™  ``

The return value wasn't mentioned in the question, so I assume the signature of the function can be something like this:

``char * MatchWithTolerance(const char * str, const char * substr);  ``

If there is a match with the given rules, return the pointer to the beginning of matched substring within the string. Else return null.

Bonus

If someone can also figure out a generic way of making the tolerance to n instead of 1, then that would be just brilliant. In that case the signature would be:

``char * MatchWithTolerance(const char * str, const char * substr, unsigned int tolerance = 1);  ``

Thanks to all those who would attempt this and share their successful solution.

### Solution:1

This seems to work, let me know if you find any errors and I'll try to fix them:

``int findHelper(const char *str, const char *substr, int mustMatch = 0)  {      if ( *substr == '\0' )          return 1;        if ( *str == '\0' )          return 0;        if ( *str == *substr )          return findHelper(str + 1, substr + 1, mustMatch);      else      {          if ( mustMatch )              return 0;            if ( *(str + 1) == *substr )              return findHelper(str + 1, substr, 1);          else if ( *str == *(substr + 1) )              return findHelper(str, substr + 1, 1);          else if ( *(str + 1) == *(substr + 1) )              return findHelper(str + 1, substr + 1, 1);          else if ( *(substr + 1) == '\0' )              return 1;          else              return 0;      }  }    int find(const char *str, const char *substr)  {      int ok = 0;      while ( *str != '\0' )          ok |= findHelper(str++, substr, 0);        return ok;  }      int main()  {      printf("%d\n", find("xxxdoogyyyy", "dog"));      printf("%d\n", find("xxxdgyyyy", "dog"));      printf("%d\n", find("xxxdigyyyy", "dog"));  }  ``

Basically, I make sure only one character can differ, and run the function that does this for every suffix of the haystack.

### Solution:2

This is related to a classical problem of IT, referred to as Levenshtein distance. See Wikibooks for a bunch of implementations in different languages.

### Solution:3

This is slightly different than the earlier solution, but I was intrigued by the problem and wanted to give it a shot. Obviously optimize if desired, I just wanted a solution.

``char *match(char *str, char *substr, int tolerance)  {    if (! *substr) return str;    if (! *str) return NULL;      while (*str)    {      char *str_p;      char *substr_p;      char *matches_missing;      char *matches_mismatched;        str_p = str;      substr_p = substr;        while (*str_p && *substr_p && *str_p == *substr_p)      {        str_p++;        substr_p++;      }      if (! *substr_p) return str;      if (! tolerance)      {        str++;        continue;      }        if (strlen(substr_p) <= tolerance) return str;        /* missed due to a missing letter */      matches_missing = match(str_p, substr_p + 1, tolerance - 1);      if (matches_missing == str_p) return str;        /* missed due to a mismatch of letters */      matches_mismatched = match(str_p + 1, substr_p + 1, tolerance - 1);      if (matches_mismatched == str_p + 1) return str;        str++;    }    return NULL;  }  ``

### Solution:4

Is the problem to do this efficiently?

The naive solution is to loop over every substring of size `substr` in `str`, from left to right, and return true if the current substring if only one of the characters is different in a comparison.

Let n = size of str Let m = size of substr

There are `O(n)` substrings in str, and the matching step takes time `O(m)`. Ergo, the naive solution runs in time `O(n*m)`

## With arbitary no. of tolerance levels.

Worked for all the test cases I could think of. Loosely based on |/|ad's solution.

``#include<stdio.h>  #include<string.h>    report (int x, char* str, char* sstr, int[] t) {      if ( x )          printf( "%s is a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );      else          printf ( "%s is NOT a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );  }    int find_with_tolerance (char *str, char *sstr, int tol) {        if ( (*sstr) == '\0' ) //end of substring, and match          return 1;        if ( (*str) == '\0' ) //end of string          if ( tol >= strlen(sstr) ) //but tol saves the day              return 1;          else    //there's nothing even the poor tol can do              return 0;        if ( *sstr == *str ) { //current char match, smooth          return find_with_tolerance ( str+1, sstr+1, tol );      } else {          if ( tol <= 0 ) //that's it. no more patience              return 0;          for(int i=1; i<=tol; i++) {              if ( *(str+i) == *sstr ) //insertioan of a foreign character                  return find_with_tolerance ( str+i+1, sstr+1, tol-i );              if ( *str == *(sstr+i) ) //deal with dletion                  return find_with_tolerance ( str+1, sstr+i+1, tol-i );              if ( *(str+i) == *(sstr+i)  ) //deal with riplacement                  return find_with_tolerance ( str+i+1, sstr+i+1, tol-i );              if ( *(sstr+i) == '\0' ) //substr ends, thanks to tol & this loop                  return 1;          }          return 0; //when all fails      }  }    int find (char *str, char *sstr, int tol ) {      int w = 0;      while (*str!='\0')          w |= find_with_tolerance ( str++, sstr, tol );      return (w) ? 1 : 0;  }    int main() {      const int n=3; //no of test cases      char *sstr = "dog"; //the substr      char *str[n] = { "doox", //those cases                      "xxxxxd",                      "xxdogxx" };      int t[] = {1,1,0}; //tolerance levels for those cases      for(int i = 0; i < n; i++) {          report( find ( *(str+i), sstr, t[i] ), *(str+i), sstr, t[i] );      }      return 0;  }  ``

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »