Tutorial :Are XHTML entity encodings valid in XML documents as long as they're contained inside CDATA tags?



Question:

Is this a valid (well-formed) XML document?

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>&copy;</inner>  </outer>  

At issue is whether the HTML/XHTML "©" entity encoding is valid in an XML document where there is no DTD or schema to define it. An alternative way of expressing the above would be to say this:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>&#169;</inner>  </outer>  

Which would seem to be valid XML with a UTF-8 encoding.

But is this valid:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner><![CDATA[&copy;]]></inner>  </outer>  

The author of the above intends to indicate to the XML parser that it should pass through the copyright symbol above as the string "&copy;" rather than as a proper Unicode character.

In that respect I find this quote a little confusing: 'New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. [But] Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup." (From Wikipedia)

I am seperately looking at a proposed XML format from a second author who has wrapped every tag in CDATA sections even when the tag can, for example, only contain digits.

Hope an XML guru can help clear up the confusion on the purpose of CDATA.

Thanks!


Solution:1

A CDATA section is for the purpose of allowing literal text that would normally be interpreted in a special way in an XML document. That is, something that looks like an entity reference, or something that looks like XML tags. Anything in a CDATA section can be inside valid XML without a CDATA section; you'll just need to use entity references to encode the various special characters so they won't be treated as XML markup, but as character data that is the value of a tag.

So yes, the following is perfectly valid, as long as it is what you intend:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner><![CDATA[&copy;]]></inner>  </outer>  

Here, the value of the inner element is the value &copy; which will not be interpreted by the XML parser as the entity reference for the copyright symbol. You can also do the following:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner><![CDATA[<normally> this looks <like/> &amp; xml </normally>]]></inner>  </outer>  

where the value for the inner element is

<normally> this looks <like/> &amp; xml </normally>  

To do this without a CDATA section:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>&lt;normally&gt; this looks &lt;like/&gt; &amp;amp; xml &lt;/normally&gt;</inner>  </outer>  

which is much less human-readable, but equivalent as far as an XML parser is concerned. If you did this (assuming that the inner element is defined an a schema or DTD as containing a string and not XML) then your XML parser will complain:

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner><normally> this looks <like/> &amp; xml </normally></inner>  </outer>  

so you use the CDATA or entity escaping to protect the special characters from the XML parser so the client of the XML data can get the value of inner which happens to contain XML markup characters.

Note: To be clear, the above example is well formed XML, but if the schema or DTD says that the element inner contains xsd:string or equivalent, then it is an invalid XML document.

And no, HTML or XHTML entities that are not defined as part of XML itself are not valid XML unless they are defined. Your XML parser will return an error.


Solution:2

Eddie gave a good reply, I just complete on some points that he apparently did not mention.

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>&copy;></inner>  </outer>  

is not legal (entity "copy" is not predefined, only "lt", "gt" and "quot" are, in XML).

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>&#169;</inner>  </outer>  

is perfectly legal and probably gives what you want (a copyright symbol).

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner><![CDATA[&copy;]]></inner>  </outer>  

is also perfectly legal but yields a quite different result (the element <inner> will contain six Unicode characters, instead of one in the previous example).

<?xml version="1.0" encoding="UTF-8" ?>   <!DOCTYPE outer[  <!ENTITY copy "&#169;">  ]>  <outer>    <inner>&copy;></inner>  </outer>  

is legal, too, and gives the same result as the second example. It can save you from typing some characters that you use but are not easy to generate with your keyboard/editor.

<?xml version="1.0" encoding="UTF-8" ?>   <outer>    <inner>©</inner>  </outer>  

is legal, too (because encoding="UTF-8", with encoding="US-ASCII", it would have been impossible), and gives the same result. Providing that your keyboard/editor allows you to use directly this character.


Solution:3

The contents of a CDATA block are ignored by the XML parser, so with regards to validation and parseability, you can put whatever you like inside CDATA.

Of course, that also comes with the fact that CDATA is treated as arbitrary, so if you want an actual © in your XML, this will not work. We are assuming you plan to load the contents of the CDATA into an X/HTML parser, just as you might load a blob of base64-encoded binary data from an image into an image parser. An XML parser makes no attempt to derive meaning from the contents of a CDATA block; it might as well say "foo" as it says &copy;.

The Wikipedia quote does seem to be confusingly-worded.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »