Tutorial :Sax invalid XML character exception



Question:

I have downloaded the xml dump of the Stack Over Flow site. While transferring the dump into a mysql database I keep running into the following error: Got an Exception: Character reference "some character set like &#x10" is an invalid XML character.

I used UltraEdit (it is a 800 meg file) to remove some characters from the file, but if I remove an invalid charater set and run the parser I get error identifying more invalid characters. Any suggestions on how to solve this?

Cheers all,

j


Solution:1

Which dump are you using? There were problems from the first version (not just invalid characters, but also < appearing where it shouldn't) but they should have been fixed in the second dump.

For what it's worth, I fixed the invalid characters in the original using two regex replaces. Replace "&#x0[12345678BCEF];" and "" each with "?" - treating them both as regular expressions, of course.


Solution:2

The set of characters permitted in XML is here. As you can see, #x10 is not one of them. If these are present in the stackoverflow dump, then it's not XML compliant.

Alternatively, you're reading the XML using the wrong character encoding.


Solution:3

You should convert your file to UTF-8 I develop in java, below is my conversion

public String FileUTF8Cleaner (File xmlfile) {

    String out = xmlfile+".utf8";      if (new File(out).exists())          System.out.println("### File conversion process ### Deleting utf8 file");          new File(out).delete();          System.out.println("### File conversion process ### Deleting utf8 file [DONE!]");        try {          System.out.println("### File conversion process ### Converting file");          FileInputStream fis = new FileInputStream(xmlfile);          DataInputStream in = new DataInputStream(fis);          BufferedReader br = new BufferedReader(new InputStreamReader(in));          String strLine;            FileOutputStream fos = new FileOutputStream(out);            while ((strLine = br.readLine()) != null) {                fos.write(strLine.replaceAll("\\p{Cc}", "").getBytes());              fos.write("\n".getBytes());          }            fos.close();          fis.close();          in.close();          br.close();          System.out.println("### File conversion process ### Converting file [DONE)]");        } catch(Exception e) {          e.printStackTrace();      }            System.out.println("### File conversion process ### Processing file : "+xmlfile.getAbsolutePath()+" [DONE!]");          return out;    }  

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »