Improving YUI 3 DataSchema-XML

I have recently been working on finalizing the Amazon Web Services Utility, which now supports all the non-cart operations and almost all the possible response groups. In the course of developing the JSON version of the utility, I had the opportunity to work closely with YUI3s DataSchema-XML, and realized its several shortcomings: it doesnt support nested schemas, the resultListLocator cannot be an XPath statement, and all lookups fail if the Amazon XML namespace is included. This article will look at how I implemented the first two improvements, as I have not been able to resolve the namespace issue, and it may be because Amazon is not providing a valid XML dtd.

Supporting Nested Schemas

I was surprised that this feature was missing. It is fairly frequent that an XML document will have mulitple nested lists of elements, and although one could iterate over the results of the first schema appling addition schemas to it; this is sub-optimal. It is more efficient and simplier to let the DataSchema API handling nested schemas instead. Here is an example XML document where a nested schema would be necessary:

Example 1: Nested Schema XML

<Items>
	<Item>
		<Id>1</Id>
		<Name>Foo</Name>
		<Attributes>
			<Attribute Name="Color">Red</Attribute>
			<Attribute Name="Height">10px</Attribute>
			<Attribute Name="Width">10px</Attribute>
		</Attributes>
	</Item>
	<Item>
		<Id>1</Id>
		<Name>Bar</Name>
		<Attributes>
			<Attribute Name="Color">Blue</Attribute>
			<Attribute Name="Height">10px</Attribute>
			<Attribute Name="Width">10px</Attribute>
		</Attributes>
	</Item>
	…
</Items>

In this case the parent schema will locate the Item tags and then a nested schema will locate the Attribute tags. Prior to todays improvements, the code to implement this would be:

Example 2: Legacy Nested Schema Support

var schemaItem = {
	resultListLocator: "Item",
	resultFields:[
		{key:"id", locator:Id},
		{key:"name", locator:Name}
	]
}, 
schemaAttribute = {
	resultListLocator: "Attribute",
	resultFields:[
		{key:"name", locator:@Name},
		{key:"value", locator:.}
	]
},
results = Y.DataSchema.XML.apply(schemaItem, xmldoc);

for (var i = 0, j = results.length; i < j; i += 1) {
	results[i].attributes = Y.DataSchema.XML.apply(schemaAttribute, results[i]);
}

With the refactor simply replace the locator property with schema to add a nested schema, and simply Example 2 to:

Example 3: New Nested Schema Support

var schemaItem = {
	resultListLocator: "Item",
	resultFields:[
		{key:"id", locator:Id},
		{key:"name", locator:Name},
		{key:"attributes", schema: {
			resultListLocator: "Attribute",
			resultFields:[
				{key:"name", locator:@Name},
				{key:"value", locator:.}
			]
		}}
	]
},
results = Y.DataSchema.XML.apply(schemaItem, xmldoc);

Besides supporting nested schemas, I have also made an improvement related to error handling. By default the DataSchema API will return a large error stack if the resultListLocator returns an empty set. Most of this time this is probably desired, however, there are times (especially with nested schemas) where an empty list is a valid response. When an empty list is allowed simply add the property allowEmpty: true, to your schema definition and no error will be returned.

Supporting XPath ResultListLocators

All the infrastructure for executing XPath lookups already existed in the API, but they were not being leveraged by the code the executes the resultListLocator. This feature is to overlook and was most likely left out of version 3.0.0, as no one realized it was required for certain XML formats. Take for example the following XML:

Example 4: XML Requiring XPath

<Request>
	<Item>DVD</Item>
	<ResponseGroup>Test</ResponseGroup>
	…
</Request>
<Items>
	<Item>
		<Id>1</Id>
		<Name>Foo</Name>
	</Item>
	<Item>
		<Id>1</Id>
		<Name>Bar</Name>
	</Item>
	…
</Items>

The schema you might try to use to fetch all the Items/Item is:

Example 5: Non-XPath Schema

var schemaItem = {
	resultListLocator: "Item",
	resultFields:[
		{key:"id", locator:Id},
		{key:"name", locator:Name}
	]
},
results = Y.DataSchema.XML.apply(schemaItem, xmldoc);

However, since getElementsByTagName is used to fetch the Item tags, this schema will return 3 Item tags instead of 2, matching the child of Request as well as the two children of Items. In order to limit the response to only those tags inside of Items, we need to support XPath lookups:

Example 6: XPath Enabled Schema

var schemaItem = {
	resultListLocator: "Items/Item",
	resultFields:[
		{key:"id", locator:Id},
		{key:"name", locator:Name}
	]
},
results = Y.DataSchema.XML.apply(schemaItem, xmldoc);

The new DataSchema-XML API will automatically evaluate if the resultListLocator is an XPath lookup or not. By default the evaluator will use getElementsByTagName, which should be faster.

Namespace Causes Parsing Failure

I was not able to determine if this was an issue with the API or Amazon. I am not an expert on XPath, but as far as I can tell the NSReslver in DataSchema-XML is setup correctly, so the XPath lookups should work if the provided namespace is valid. This leads me to believe that the issue with the XML namespace for AWS is actually Amazon and not the API.

Please leave a comment if you have a working XML namespace in any documents you parsed using DataSchema-XML or if you have experienced the same issue as I.

Conclusion

The new dataschema-xml2.js has many improvements, but most importantly it supports nested schemas and XPath lookups for resultListLocator. You can see it in action on the AWS Test Page. I am in the process of getting these changes approved to go into the github trunk build, but in the meantime you will need to load "dataschame-xml2.js" as you would any other non-YUI module.