Tutorial :Guessing the encoding of text represented as byte[] in Java


Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.


The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {      String DEFAULT_ENCODING = "UTF-8";      org.mozilla.universalchardet.UniversalDetector detector =          new org.mozilla.universalchardet.UniversalDetector(null);      detector.handleData(bytes, 0, bytes.length);      detector.dataEnd();      String encoding = detector.getDetectedCharset();      detector.reset();      if (encoding == null) {          encoding = DEFAULT_ENCODING;      }      return encoding;  }  

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.


There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.


Here's my favorite: https://github.com/codehaus/guessencoding

It works like this:

  • If there's a UTF-8 or UTF-16 BOM, return that encoding.
  • If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
  • If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
  • Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.


Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:


Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.


Check out jchardet


Should be stuff already available

google search turned up icu4j




Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

How to determine if a String contains invalid encoded characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »