Tutorial :How do I remove diacritics (accents) from a string in .NET?



Question:

I'm trying to convert some strings that are in French Canadian and basically, I'd like to be able to take out the French accent marks in the letters while keeping the letter. (E.g. convert é to e, so crème brûlée would become creme brulee)

What is the best method for achieving this?


Solution:1

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

static string RemoveDiacritics(string text)   {      var normalizedString = text.Normalize(NormalizationForm.FormD);      var stringBuilder = new StringBuilder();        foreach (var c in normalizedString)      {          var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);          if (unicodeCategory != UnicodeCategory.NonSpacingMark)          {              stringBuilder.Append(c);          }      }        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);  }  

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.


Solution:2

this did the trick for me...

string accentedStr;  byte[] tempBytes;  tempBytes = System.Text.Encoding.GetEncoding("ISO-8859-8").GetBytes(accentedStr);  string asciiStr = System.Text.Encoding.UTF8.GetString(tempBytes);  

quick&short!


Solution:3

In case someone is interested, I was looking for something similar and ended writing the following:

    public static string NormalizeStringForUrl(string name)      {          String normalizedString = name.Normalize(NormalizationForm.FormD);          StringBuilder stringBuilder = new StringBuilder();            foreach (char c in normalizedString)          {              switch (CharUnicodeInfo.GetUnicodeCategory(c))              {                  case UnicodeCategory.LowercaseLetter:                  case UnicodeCategory.UppercaseLetter:                  case UnicodeCategory.DecimalDigitNumber:                      stringBuilder.Append(c);                      break;                  case UnicodeCategory.SpaceSeparator:                  case UnicodeCategory.ConnectorPunctuation:                  case UnicodeCategory.DashPunctuation:                      stringBuilder.Append('_');                      break;              }          }          string result = stringBuilder.ToString();          return String.Join("_", result.Split(new char[] { '_' }              , StringSplitOptions.RemoveEmptyEntries)); // remove duplicate underscores      }  


Solution:4

In case anyone's interested, here is the java equivalent:

import java.text.Normalizer;    public class MyClass  {      public static String removeDiacritics(String input)      {          String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);          StringBuilder stripped = new StringBuilder();          for (int i=0;i<nrml.length();++i)          {              if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)              {                  stripped.append(nrml.charAt(i));              }          }          return stripped.toString();      }  }  


Solution:5

I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:

  • Normalizing to form D splits charactes like è to an e and a nonspacing `
  • From this, the nospacing characters are removed
  • The result is normalized back to form C (I'm not sure if this is neccesary)

Code:

using System.Linq;  using System.Text;  using System.Globalization;    // namespace here  public static class Utility  {      public static string RemoveDiacritics(this string str)      {          if (null == str) return null;          var chars =              from c in str.Normalize(NormalizationForm.FormD).ToCharArray()              let uc = CharUnicodeInfo.GetUnicodeCategory(c)              where uc != UnicodeCategory.NonSpacingMark              select c;            var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);            return cleanStr;      }        // or, alternatively      public static string RemoveDiacritics2(this string str)      {          if (null == str) return null;          var chars = str              .Normalize(NormalizationForm.FormD)              .ToCharArray()              .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)              .ToArray();            return new string(chars).Normalize(NormalizationForm.FormC);      }  }  


Solution:6

I needed something that converts all major unicode characters and the voted answer leaved a few out so I've created a version of CodeIgniter's convert_accented_characters($str) into C# that is easily customisable:

using System;  using System.Text;  using System.Collections.Generic;    public static class Strings  {      static Dictionary<string, string> foreign_characters = new Dictionary<string, string>      {          { "äæǽ", "ae" },          { "öÅ"", "oe" },          { "ü", "ue" },          { "Ä", "Ae" },          { "Ãœ", "Ue" },          { "Ö", "Oe" },          { "ÀÁÂÃÄÅǺĀĂĄǍÎ'ΆẢẠẦẪẨẬẰẮẴẲẶА", "A" },          { "àáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặа", "a" },          { "Ð'", "B" },          { "б", "b" },          { "ÇĆĈĊČ", "C" },          { "çćĉċč", "c" },          { "Ð"", "D" },          { "д", "d" },          { "ÐĎĐÎ"", "Dj" },          { "ðďÄ'δ", "dj" },          { "ÈÉÊËÄ'Ä"ĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭ", "E" },          { "èéêëÄ"ĕėęěέεẽẻẹềếễểệеэ", "e" },          { "Ф", "F" },          { "Ñ„", "f" },          { "ĜĞĠĢÎ"Ð"Ґ", "G" },          { "ĝğġģγгÒ'", "g" },          { "ĤĦ", "H" },          { "ĥħ", "h" },          { "ÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫ", "I" },          { "ìíîïĩīĭǐįıηήίιϊỉịиыї", "i" },          { "Ä´", "J" },          { "ĵ", "j" },          { "ĶΚК", "K" },          { "ķκк", "k" },          { "ĹĻĽĿŁΛЛ", "L" },          { "ĺļľŀłλл", "l" },          { "Ðœ", "M" },          { "м", "m" },          { "Ã'ŃŅŇΝН", "N" },          { "ñńņňʼnνн", "n" },          { "Ã'Ã"Ã"ÕŌŎÇ'ŐƠØǾΟΌΩΏỎỌá»'ỐỖá»"ỘỜỚỠỞỢО", "O" },          { "òóôõōŏÇ'Å'ơøǿºοόωώỏọá»"á»'ỗổộờớỡởợо", "o" },          { "П", "P" },          { "п", "p" },          { "Å"ŖŘΡР", "R" },          { "ŕŗřρр", "r" },          { "ŚŜŞȘŠΣС", "S" },          { "śŝşșšſσςс", "s" },          { "ȚŢŤŦτТ", "T" },          { "țţťŧт", "t" },          { "ÙÚÛŨŪŬŮŰŲƯÇ"ǕǗǙǛŨỦỤỪỨỮỬỰУ", "U" },          { "ùúûũūŭůűųưÇ"ǖǘǚǜυύϋủụừứữửựу", "u" },          { "ÝŸŶΥΎΫỲỸỶỴЙ", "Y" },          { "ýÿŷỳỹỷỵй", "y" },          { "Ð'", "V" },          { "в", "v" },          { "Å´", "W" },          { "ŵ", "w" },          { "ŹŻŽΖЗ", "Z" },          { "źżžζз", "z" },          { "ÆǼ", "AE" },          { "ß", "ss" },          { "IJ", "IJ" },          { "ij", "ij" },          { "Å'", "OE" },          { "Æ'", "f" },          { "ξ", "ks" },          { "Ï€", "p" },          { "β", "v" },          { "μ", "m" },          { "ψ", "ps" },          { "Ё", "Yo" },          { "Ñ'", "yo" },          { "Є", "Ye" },          { "Ñ"", "ye" },          { "Ї", "Yi" },          { "Ж", "Zh" },          { "ж", "zh" },          { "Ð¥", "Kh" },          { "Ñ…", "kh" },          { "Ц", "Ts" },          { "ц", "ts" },          { "Ч", "Ch" },          { "ч", "ch" },          { "Ш", "Sh" },          { "ш", "sh" },          { "Щ", "Shch" },          { "щ", "shch" },          { "ЪъЬь", "" },          { "Ю", "Yu" },          { "ÑŽ", "yu" },          { "Я", "Ya" },          { "я", "ya" },      };        public static char RemoveDiacritics(this char c){          foreach(KeyValuePair<string, string> entry in foreign_characters)          {              if(entry.Key.IndexOf (c) != -1)              {                  return entry.Value[0];              }          }          return c;      }        public static string RemoveDiacritics(this string s)       {          //StringBuilder sb = new StringBuilder ();          string text = "";              foreach (char c in s)          {              int len = text.Length;                foreach(KeyValuePair<string, string> entry in foreign_characters)              {                  if(entry.Key.IndexOf (c) != -1)                  {                      text += entry.Value;                      break;                  }              }                if (len == text.Length) {                  text += c;                }          }          return text;      }  }  

Usage

// for strings  "crème brûlée".RemoveDiacritics (); // creme brulee    // for chars  "Ã"[0].RemoveDiacritics (); // A  


Solution:7

The CodePage of Greek (ISO) can do it

The information about this codepage is into System.Text.Encoding.GetEncodings(). Learn about in: https://msdn.microsoft.com/pt-br/library/system.text.encodinginfo.getencoding(v=vs.110).aspx

Greek (ISO) has codepage 28597 and name iso-8859-7.

Go to the code... \o/

string text = "Você está numa situação lamentável";    string textEncode = System.Web.HttpUtility.UrlEncode(text, Encoding.GetEncoding("iso-8859-7"));  //result: "Voce+esta+numa+situacao+lamentavel"    string textDecode = System.Web.HttpUtility.UrlDecode(textEncode);  //result: "Voce esta numa situacao lamentavel"  

So, write this function...

public string RemoveAcentuation(string text)  {      return          System.Web.HttpUtility.UrlDecode(              System.Web.HttpUtility.UrlEncode(                  text, Encoding.GetEncoding("iso-8859-7")));  }  

Note that... Encoding.GetEncoding("iso-8859-7") is equivalent to Encoding.GetEncoding(28597) because first is the name, and second the codepage of Encoding.


Solution:8

This works fine in java.

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;  import java.util.regex.Pattern;    public String deAccent(String str) {      String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);       Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");      return pattern.matcher(nfdNormalizedString).replaceAll("");  }  


Solution:9

THIS IS THE VB VERSION (Works with GREEK) :

Imports System.Text

Imports System.Globalization

Public Function RemoveDiacritics(ByVal s As String)      Dim normalizedString As String      Dim stringBuilder As New StringBuilder      normalizedString = s.Normalize(NormalizationForm.FormD)      Dim i As Integer      Dim c As Char      For i = 0 To normalizedString.Length - 1          c = normalizedString(i)          If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then              stringBuilder.Append(c)          End If      Next      Return stringBuilder.ToString()  End Function  


Solution:10

This is how i replace diacritic characters to non-diacritic ones in all my .NET program

C#:

//Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'  public string RemoveDiacritics(string s)  {      string normalizedString = null;      StringBuilder stringBuilder = new StringBuilder();      normalizedString = s.Normalize(NormalizationForm.FormD);      int i = 0;      char c = '\0';        for (i = 0; i <= normalizedString.Length - 1; i++)      {          c = normalizedString[i];          if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)          {              stringBuilder.Append(c);          }      }        return stringBuilder.ToString().ToLower();  }  

VB .NET:

'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'  Public Function RemoveDiacritics(ByVal s As String) As String      Dim normalizedString As String      Dim stringBuilder As New StringBuilder      normalizedString = s.Normalize(NormalizationForm.FormD)      Dim i As Integer      Dim c As Char        For i = 0 To normalizedString.Length - 1          c = normalizedString(i)          If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then              stringBuilder.Append(c)          End If      Next      Return stringBuilder.ToString().ToLower()  End Function  


Solution:11

you can use string extension from MMLib.Extensions nuget package:

using MMLib.RapidPrototyping.Generators;  public void ExtensionsExample()  {    string target = "aácčeéií";    Assert.AreEqual("aacceeii", target.RemoveDiacritics());  }   

Nuget page: https://www.nuget.org/packages/MMLib.Extensions/ Codeplex project site https://mmlib.codeplex.com/


Solution:12

It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

Since the original question was related to French, the simplest working answer is indeed

    public static string ConvertWesternEuropeanToASCII(this string str)      {          return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));      }  

1251 should be replaced by the encoding code of the input language.

This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

    public static string LatinizeGermanCharacters(this string str)      {          StringBuilder sb = new StringBuilder(str.Length);          foreach (char c in str)          {              switch (c)              {                  case 'ä':                      sb.Append("ae");                      break;                  case 'ö':                      sb.Append("oe");                      break;                  case 'ü':                      sb.Append("ue");                      break;                  case 'Ä':                      sb.Append("Ae");                      break;                  case 'Ö':                      sb.Append("Oe");                      break;                  case 'Ü':                      sb.Append("Ue");                      break;                  case 'ß':                      sb.Append("ss");                      break;                  default:                      sb.Append(c);                      break;              }          }          return sb.ToString();      }  

It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

I also have a very simple method to remove space:

    public static string RemoveSpace(this string str)      {          return str.Replace(" ", string.Empty);      }  

Eventually, I am using a combination of all 3 above extensions:

    public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)      {          str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();                      return keepSpace ? str : str.RemoveSpace();      }  

And a small unit test to that (not exhaustive) which pass successfully.

    [TestMethod()]      public void LatinizeAndConvertToASCIITest()      {          string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ãœ ü ù ú û Û ý Ý ç Ç ñ Ã'";          string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";          string actual = europeanStr.LatinizeAndConvertToASCII();          Assert.AreEqual(expected, actual);      }  


Solution:13

Try HelperSharp package.

There is a method RemoveAccents:

 public static string RemoveAccents(this string source)   {       //8 bit characters        byte[] b = Encoding.GetEncoding(1251).GetBytes(source);         // 7 bit characters       string t = Encoding.ASCII.GetString(b);       Regex re = new Regex("[^a-zA-Z0-9]=-_/");       string c = re.Replace(t, " ");       return c;   }  


Solution:14

What this person said:

Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(text));

It actually splits the likes of å which is one character (which is character code 00E5, not 0061 plus the modifier 030A which would look the same) into a plus some kind of modifier, and then the ASCII conversion removes the modifier, leaving the only a.


Solution:15

Popping this Library here if you haven't already considered it. Looks like there are a full range of unit tests with it.

https://github.com/thomasgalliker/Diacritics.NET


Solution:16

Imports System.Text  Imports System.Globalization     Public Function DECODE(ByVal x As String) As String          Dim sb As New StringBuilder          For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)                sb.Append(c)          Next          Return sb.ToString()      End Function  


Solution:17

I really like the concise and functional code provided by azrafe7. So, I have changed it a little bit to convert it to an extension method:

public static class StringExtensions  {      public static string RemoveDiacritics(this string text)      {          const string SINGLEBYTE_LATIN_ASCII_ENCODING = "ISO-8859-8";            if (string.IsNullOrEmpty(text))          {              return string.Empty;          }            return Encoding.ASCII.GetString(              Encoding.GetEncoding(SINGLEBYTE_LATIN_ASCII_ENCODING).GetBytes(text));      }  }  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »