Wednesday, October 20, 2010

Groovy: Retrieving the Value of Multiple XML Elements

Yesterday, I ran into an interesting case at work where some code was parsing XML using Groovy's XmlSlurper to retrieve the value of an element and treat it as a String. Something along the likes of:

  def xml = "<xml><character>a</character></xml>"
def node = new XmlSlurper().parseText(xml)
String result = node.character
println result

Which simply prints out a. The code was expected to only return value of the first element found, but when a second element is added:

  def xml = "<xml><character>a</character><character>b</character></xml>"
def node = new XmlSlurper().parseText(xml)
String result = node.character
println result

The resulting output is ab, which broke a new test case.

One solution to the issue is to grab the first element with:

  String result = node.character[0]

Another interesting point is that the result of node.character[0] is a NodeChild, not a String. Since the type of the result variable is declared, the right side of the assignment was coerced to a String. If that were not the case, and we had:

  def result = node.character[0]
println result.getClass().getName()
println result

The output would be:

  groovy.util.slurpersupport.NodeChildren
a

Note:We need to use getClass() as result.class would return the class node and not the Class object itself.

Alternatively, we can use the text() method of NodeChildren (which also exists for NodeChild) to ensure we get a String:

  def result = node.character[0].text()
println result.getClass().getName()
println result

Giving us:

  java.lang.String
a

To summarize, putting all the methods discussed so far to use with:

  def xml = "<xml><character>a</character><character>b</character></xml>"
def node = new XmlSlurper().parseText(xml)

println node.character
println node.character.getClass().getName()
println()

println node.character.text()
println node.character.text().getClass().getName()
println()

println node.character[0]
println node.character[0].getClass().getName()
println()

println node.character[0].text()
println node.character[0].text().getClass().getName()
println()

We get:

  ab
groovy.util.slurpersupport.NodeChildren

ab
java.lang.String

a
groovy.util.slurpersupport.NodeChild

a
java.lang.String

While the original gotcha might not be all that hard to resolve, hopefully this gives a little insight to those who might explore things a little bit further.

Tuesday, October 19, 2010

Bash: Learning about Arrays

Last year, I wrote about a little Groovy script to help me validate XML documents against a schema. A ended up with a little script to call to help me from having to recall how to run it using Maven. The script takes a schema location and an example instance document as arguments:

  #!/bin/bash
mvn -o exec:java -Dexec.mainClass=Validator -Dexec.args="$1 $2"

Which I can call with:

  $ ./validate.sh order.xsd order.xml

We store example XML documents along with our internal schemas, and sometimes I find myself running the same script multiple times, once for each document in the project, so I thought having a script to loop through the documents in a directory and report the results would also be helpful.

In order to report both the 'good' instances and the 'bad' instances at the end of the scripts run, I needed to learn a little about bash arrays.

In found that in bash, there are multiple ways to create an array. You can start by simply assigning a value to a yet unused array:

  #!/bin/bash
good[0]="my.xml"
echo ${good[0]}
Or you can declare an array using:
  #!/bin/bash
declare -a good = ("my.xml", "your.xml")
echo ${good[0]}
echo ${good[1]}

In my case, I will be iterating through the files and adding them to the appropriate array, so I won't be able to declare the array or its size up front. I used this method to first check if the array is empty in an if statement, and if so declare and initialize it. If the array does in fact exist, I append an element to in the else clause, by using @ to get the length of the array:

  #!/bin/bash
if [ ${#bad[0]} -eq 0 ]; then
declare -a bad=("$i")
else
bad=("${bad[@]}", $i)
fi

To roll everything up, my new scripts takes the schema location and a directory as arguments. For every, directory listing in the directory, I run the Groovy code against it. If the validation failed, setting $? to 1, I add the file to the 'bad' list. Otherwise, it goes to the good list:

  #!/bin/bash
xsd=$1
dir=$2

for i in `ls $dir`
do
echo -n "$i..."
mvn -o exec:java -Dexec.mainClass=Validator -Dexec.args="$1 $2/$i"

if [ $? -eq 1 ]; then
if [ ${#bad[0]} -eq 0 ]; then
declare -a bad=("$i")
else
bad=("${bad[@]}", $i)
fi
else
if [ ${#good[0]} -eq 0 ]; then
declare -a good=("$i")
else
good=("${good[@]}", $i)
fi
fi
done

echo
echo "Good files: "
echo "${good[@]}"
echo
echo "Failed files: "
echo ${bad[@]}
echo

At the end of script, I simply echo both lists. An example of the output for the report:

  Good files: 
good1.xml good2.xml good3.xml

Failed files:
bad1.xml bad2.xml bad3.xml

I'm sure I could remove some of the duplication, but I since I have only two arrays and this is just a helper, I think I'll leave things be for me. If you have any more array advice, please leave a comment!

Friday, October 1, 2010

Trang: Creating Schemas from XML

Does anyone like writing XML schemas? Sometimes they can be frustrating, and yet always ends up feeling simple when you're done. When given the choice, it always feels good to me to start writing a schema from an example instance document, and of course there are plenty of tools to help.

While a lot of tools are available for a price, Trang however is free, and helps me with the writer's block I tend to get when I'm handed an XML document and asked to make a schema from scratch.

Trang is a Java app that can be downloaded in zip form. Luckily (for me at least), it was available in the Ubuntu package repositores:

$ sudo apt-get install trang

Now that we have Trang installed, let's generate a schema from a simple XML document, languages.xml:

<?xml version='1.0' encoding='UTF-8'?>
<languages>
<language>
<name>Groovy</name>
<platform>JVM</platform>
<appeared>2003</appeared>
</language>
<language>
<name>Scala</name>
<platform>JVM</platform>
<appeared>2003</appeared>
</language>
<language>
<name>Boo</name>
<platform>CLR</platform>
<appeared>2003</appeared>
</language>
</languages>

Now let's tell Trang, gratefully, to make us a schema:

$ trang languages.xml languages.xsd

And a schema is generated for us:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="languages">
<xs:complexType>
<xs:sequence>
<xs:element maxOccurs="unbounded" ref="language"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="language">
<xs:complexType>
<xs:sequence>
<xs:element ref="name"/>
<xs:element ref="platform"/>
<xs:element ref="appeared"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="name" type="xs:NCName"/>
<xs:element name="platform" type="xs:NCName"/>
<xs:element name="appeared" type="xs:integer"/>
</xs:schema>

This is a real good start, but I feel like I should make a couple changes. To make things a little easier to understand for consumers, I think I'll change the uses of NCName to string:

<xs:element name="name" type="xs:NCName"/>
<xs:element name="platform" type="xs:NCName"/>

And require at least one language element for the document to be valid:

<xs:element minOccurs="1" maxOccurs="unbounded" ref="language"/>

My edited schema becomes:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
<xs:element name="languages">
<xs:complexType>
<xs:sequence>
<xs:element minOccurs="1" maxOccurs="unbounded" ref="language"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="language">
<xs:complexType>
<xs:sequence>
<xs:element ref="name"/>
<xs:element ref="platform"/>
<xs:element ref="appeared"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="name" type="xs:string"/>
<xs:element name="platform" type="xs:string"/>
<xs:element name="appeared" type="xs:integer"/>
</xs:schema>

Even with this simple example, Trang has saved me a lot of typing. Trang also has the options to also create RELAX NG and DTD documents if you need them.