Sunday, January 20, 2008

XML 1.0 versus Web Services

On the cxf-user mailing list, we see a question over and over:

"Why can't I sent an escape character to my web service?"

On further examination, the symptoms are always the same. The schema type is xs:string. The data contains a C0 control character, such as Form Feed or Escape. And the results are an error or that the character disappears. And the OP is annoyed.

The OP generally gets even more annoyed when we reveal the ugly truth: there's no good solution to this.


Long ago, when the world was young, the W3C created the specification for XML 1.0. They chose Unicode as the fundamental representation of text, so that everyone's favorite poem could be captured in an XML document.

However, they made the mistake of actually reading the Unicode specification in detail (an exercise for the insomniac if ever there was one). And they spotted the presence of the dusty, musty, old-fashioned ASCII control characters.

Having noticed them, they banished them from XML as per section 2.2. You may think I'm being bombastic with the term 'banished,' but it's really the simple truth. An XML document cannot contain any characters outside of production [2] of section 2.2. A character represented with an & is still a character. So, 
 is no more permitted than the same character sitting there, literally, in the document.

Here comes the nasty part. Consider a simple Web Service. The service has a WSDL, the WSDL has a schema, and the schema specifies a string. An xsd:string. The web service binding will, quite cheerfully, map this to a Java String or a C## string. Any now the fuse begins to burn...

Java Strings and C# strings hold any Unicode characters. Not just the ones that are valid in XML. xsd:string values, on the other hand, describe the XML content model, and so cannot contain control characters. By a certain logic, toolkits should refuse to map xsd:string to plain old String data types. They should map them to some class that checks for compliance with XML. You can imagine how popular that would be.

This problem seems to have become more visible of late. Why? Because more XML parsers are paying attention to section 2.2 of the specification. A few years back, all the common Java XML parsers ignored the restriction, and only the Microsoft DOM made a conspicuous point of rejecting invalid characters. Now, mainstream parsers, as used by mainstream web service toolkits, are paying attention. In the case of WoodStox, sadly, the attention being paid consists of discarding the rejects rather than diagnosing them.

Application developers are not happy. They want to send document content through their web service, without worrying about the occasional stray form feed.

What is to be done?

Well, there's XML 1.1. It does not forbid these characters. However, all of the web service specifications demand XML 1.0, and there's no sign on the horizon of any alternative. So there's no help there.

There's base64. Particularly for short strings, xsd:base64Binary is the only practical solution. Sadly, data bindings for web services don't give you much help here. You'd like to @nnotate that you want to have a Java String as the Java datatype, xsd:base64Binary as the schema datatype, and let the generated code take care of everything else. No such luck. You can call mystring.getBytes("utf-8") and pass the resulting byte[] into your service, and reverse the process on the other end.

Be careful with that UTF-8, by the way. In JavaScript, in particular, there are many base64 packages floating around that assume that the data will be ASCII.

If you have a lot of data, it's time to contemplate attachments.

I just finished teaching CXF's JavaScript client generator to handle MTOM attachments for this purpose. Half the battle was the sloppy documentation on MTOM on the web. Beware of Metro's documentation here. It has bugs in the example of the wire traffic and bugs in the schema for xmime:ContentType. Other than that it's quite informative.

By the time I was done, a Java side DataHandler with content type of 'text/plain;charset=utf-8' was mapped, bidirectionally, to a JavaScript variable in the browser, with an MTOM attachment in between.

Anyone who needs to ship arbitrary textual content through a web service has to think about this. If your application can scoop up a form feed, there is an angel with a flaming sword standing between you and the convenience of xsd:string. You have been warned.

1 comment:

Tatu said...

Just thought I'd send you note to something related: as of Woodstox 4.0 (to be released Any Day Now), there is a new property that allows replacing such control characters when writing XML:

(since most of the time such characters are not really needed in there but just snuck in somehow)

I actually got inspired to add this feature based on this blog entry...
so consider this a belated "thank you" note. :-)

Woodstox 4.0 also has top-notch support for reading and writing base64 encoded binary data as well, but that's another story.