Mar 2008

XML Benchmarks - Parse/Query/Mutate/Serialize

(8:41 am) Tags: [Software, Projects, D Programming Language]

I created a benchmark similar to the one that VTD-XML uses. Basically, since most xml processing is mutation, this benchmark parses an input xml file, executes various xpaths on the file, modifying the document in 2 instances, and then serializes the new document. The steps are listed below:

  1. Parse blog.xml, preparing to query the resulting document
  2. Perform the following xpath queries, or their equivalents, once each:
    • count(//*) (10390 for this document)
    • //item (a list of those 10390 items)
    • /blog/item (similar to the previous, except you know the path)
    • //text() (all text nodes)
    • count(//item)
    • count(/blog/item)
    • /blog/item[@num=’a781′]
    • /blog/item/body/p/a
  3. Mutate the document by removing the resulting nodes from the last 2 queries (performed inline with the queries)
  4. serialize the modified document back out

I created this benchmark for 4 products (the ones that have xpath or xpath-like support, if you know of another one, please submit me some code, and I will be happy to run and aggregate the results):

After the run, I take the average cycle time, and turn that into the followin graph showing cycles per second. blog.xml is 1.3MB, so you can multiply these numbers by 1.3 to get the Megabytes per second number for each tool.

Some notes of the implementations:

Would also note that these benchmarks were run on an Intel Q6700 quad core machine at 2.66 GHz, with 4GB of RAM, running Ubunu Linux.

Popularity: 92%

3 Responses to “XML Benchmarks - Parse/Query/Mutate/Serialize”

  1. Jimmy Zhang Says:

    Interesting results but a few questions:
    1. do you run the test using server JVM?
    2. have you considered precompile XPath? instead of compile them again and again in the loop?
    3. What do you mean by delete in a delete? It doesn’t make sense to me…

    if you do all those things, I will be shocked that VTD will under perform tango D because tango D needs to repetitively serialize and parse, while VTD-XML is incremental…

    Jimmy Zhang

  2. Scott Sanders Says:

    1. I did use the server vm, yes.
    2. Can you post a diff against my VTD example? I know you know the VTD API better, so just show me what to change, and I will change it, no problem.
    3. If you look at the example, look at the commented out remove call. uncomment and run that, and you will get an exception.

    As for your last comment, I specifically force a parse each time, rather than an index load, because I am trying to compare a real-world xml appliance sort of scenario, where you see many different documents once. On the serialize front, Tango is actually as fast serializing from scratch as it is keeping a cache of the input to spit back out.

  3. Scott Sanders Says:

    I updated the vtd test to do the first delete, output the result, re-parse, and then do the final deletes, and throughput drops to 30.3 operations per second.

    It is worth noting that since VTD-XML indexes everything by position, an edit that adds/removes data requires a re-parse, because all the indexes are now invalid.

    Even with that overhead, VTD is head and shoulders above DOM and DOM4J.