public class Aalto
{
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
long length = file.length();
byte[] bytes = new byte[(int)length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException(”Could not completely read file “+file.getName());
}
is.close();
return bytes;
}
public static void main (String args[]) throws Exception
{
int iterations = 2000;
Average for hamlet.xml: 147.22 MB/sec
Average for soap_mid.xml: 43.80 MB/sec
As noted on the website, Aalto does seem to be quite fast on the “fast path”. Impressive for a Java solution at this point.
Update: 2008-03-03 13:15 PST: Thanks to Paul Findlay for catching my misspelling of the aalto.jar in the java run command. These numbers posted are actually for the default Java6 StaX parser, and not Aalto. Re-running, I get:
Average for hamlet.xml: 147.85 MB/sec
Average for soap_mid.xml: 85.95 MB/sec
Much more impressive numbers from the Java camp. Graphs will be updated later today.
public class Javolution
{
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
long length = file.length();
byte[] bytes = new byte[(int)length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException(”Could not completely read file “+file.getName());
}
is.close();
return bytes;
}
public static void main (String args[]) throws Exception
{
int iterations = 2000;
public class Woodstox
{
public static byte[] getBytesFromFile(File file) throws IOException {
InputStream is = new FileInputStream(file);
long length = file.length();
byte[] bytes = new byte[(int)length];
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (numRead=is.read(bytes, offset, bytes.length-offset)) >= 0) {
offset += numRead;
}
if (offset < bytes.length) {
throw new IOException(”Could not completely read file “+file.getName());
}
is.close();
return bytes;
}
public static void main (String args[]) throws Exception
{
int iterations = 2000;
I added Java DOM to the graphs. Building a tree in memory is not the fastest way to parse a doc, but it is the easiest way to modify the doc after parsing. Java 6 DOM shows off not too terribly bad in the parsing speed, but with all the allocation going on, RAM usage skyrockets, and the efficiency graph shows the pain.
This goes to show you how good library design and the D Programming Language come together to kick serious butt.
Speed master Kris made some changes to Tango’s xml libraries today, and increased the performance of the parser to over 500MB/second! The machine is still the quad core 2.66GHz Intel box running Linux with 4GB of RAM. This run reflects revision 3286 of Tango SVN.
I will only update the images here, I think you should now know how I obtained them…
While SAX is showing slower in speed than DOM in Tango (I hope that is as weird to read as it was for me to write), you can see that the RAM usage graph puts it back into perspective.
I also forgot to note that this quad core box is now capable of parsing XML at over 2GB/sec if all 4 cores are used. Impressive indeed.
I decided to post a graph of speed versus resource usage as an interesting view into the overhead of the various programs. Since all benchmarks maxxed out the CPU at 100%, and all cached the data to be parsed, so disk wasn’t being used, that leaves RAM as a measurement of resource usage. The following is a chart of the parsing speed divided by the memory usage. Of note was xmlpull and xmlsax using 688KB of memory, so their numbers actually increased, showing not only the speed, but the conservation of resources. The RAM numbers were taken from top while the program was running, and represent the “Resident Set” so as not to make Java look horribly bad.
Update: 2008-02-24 15:45 PST - I updated the graph to offset the RAM usage by subtracting the file size from the total RAM, so that as the files get larger, they won’t be penalized. To put it into other words, the closer you can keep RAM usage to the filesize, decreasing overhead, the more resource efficient your parser is. I bet you are thinking Tango was designed that way from the beginning right about now, aren’t you?
Average parsing speed: 79.02 and 39.83 MB/sec, respectively. Note that I did remove the DTD declaration from hamlet.xml for this benchmark, since it was erroring out trying to find play.dtd.
Ouput from java -version:
stonecobra@jeff-home:~/xmlbench$ java -version
java version “1.6.0_03″
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Server VM (build 1.6.0_03-b05, mixed mode)
Many thanks to Nietsnie who was kind enough to write up a libxml2 sax benchmark, and run it on his quad core 2.66GHz box running linux. I have updated other benchmarks to reflect using his machine as well, to keep all on the same playing field. test.c is the benchmark code used, listed here:
Here is the current summary of the benchmarks run so far in a graphical form:
I hope to add more (libxml2, Xerces-C, etc) in the future. If you have C++ chops, I am looking for someone to code up one for MSXML. I will also be adding some Java benchmarks in here as well.
Update 2008-02-23 20:57 PST - Since Nietsnie was kind enough to donate his machine time, I re-ran all the current benchmarks on his box, to be able to include the libxml2 sax numbers as apples to apples. The graph is now updated, and includes the speed (Megabytes per second). Thanks to Robert Fraser for catching that.
The current benchmarking machine is an Ubuntu box with 4GB RAM sporting a quad-core Intel chip at 2.66GHz. In other words, much faster than my machine.
I hesitate to publish these numbers, as they are not direct apples to apples comparison. The reason is that the D Programming Language version 2.0’s std.xml is an xml parser, but one where you must know the schema beforehand, and register handlers for each element by name. I was unwilling/too lazy to write said handlers for the docs I was doing, so I found a method called check(), that according to the source code comments makes sure that a document is well-formed, and contains no bad characters. That’s as close as I am going to get to parsing these docs without code help from the community, so take this with a grain of salt or two. I am using DMD 2.011, using stdxml.d to benchmark, listed here:
Average for hamlet.xml: 6.51 MB/sec.
Average for soap_mid.xml: 4.39 MB/sec.
PS: I also wanted to note for any naysayers, that I left off -O -release and -inline because the phobos example actually runs SLOWER with any and/or all of these flags. I am not trying to slip anything by anyone here.
Next is Tango’s SaxParser, a SAX API layered on top of PullParser for the D Programming Language. It passes parsing events through to a handler, push-style. I used the current SVN HEAD of Tango, which is current revision 3247, and compiled with DMD v1.024. I count the number of elements, attributes, and text nodes, along with their lengths, to attempt to compare to the benchmarks here. Apparently, Tango is beating them masterfully. soap_mid.xml is the same file (by size, and I suspect, origin) as their “soap2.xml”. And they have an extra 200MHz of CPU in their benchmark. The benchmark code used was xmlsax.d, listed here:
void main()
{
auto content = import ("hamlet.xml");
auto parser = new SaxParser!(char);
auto handler = new LengthHandler!(char);
parser.setSaxHandler(handler);
parser.setContent(content);
for (int i = 11; --i;)
benchmark (2000, parser, content);
}
private class LengthHandler(Ch = char) : SaxHandler!(Ch) {
public uint elm;
public uint att;
public uint txt;
public uint elmlen;
public uint attlen;
public uint txtlen;
Next is Tango’s Document, a DOM-ish parser built on top of PullParser fro the D Programming Language. It builds an in-memory tree of the document being parsed, which can then be easily navigated/edited in-memory. I used the current SVN HEAD of Tango, which is current revision 3247, and compiled with DMD v1.024. The benchmark code used was xmldom.d, listed here:
Average of the runs was 118.19 MB/sec parsing. Looks like a similar result to PullParser. Attributes must have a fairly high cost in this implementation.
Update 2008-02-23 19:57 PST
Running on a quad core 2.66GHz box yielded:
First up, Tango’s tango.text.xml.PullParser. You instantiate the parser, start the parse, and then continue to ask for the next ‘node’. I used the current SVN HEAD of Tango, which at the time of writing was revision 3247, compiled with DMD v1.024. The benchmark code ran is xmlpull.d, and is listed here:
Average of the resulting run: 229.06. Lower than hamlet.xml, probably due to the attribute processing required, but also possibly the lack of whitespace.
Update 2008-02-23 19:57 PST
Running on a quad core 2.66GHz box yielded:
Average for hamlet.xml: 476.77 MB/sec.
Average for soap_mid.xml: 339.15MB/sec. Now we are talking some speed!!! This D Programming Language has some merit.
In wanting to see how well the Tango XML parsers fair in the world, I have started this benchmarking post. I will post all of my results, as well as the code and files that achieve these results here, so this post will be living as I expand and update it.
First off, baseline equipment. I have a Thinkpad T60p with 2.0Ghz Intel T2500 CPU, 2GB RAM, and a fairly slow hard drive. All of my tests will cache the document to be parsed in memory to try and elminate the hard drive as a potential bottleneck.
Next up, the files. I will be starting with hamlet.xml and soap_mid.xml. hamlet.xml weighs in at 274KB, and contains no attributes at all, very element heavy, with a moderate amount of whitespace (enough to make the file readable). soap_mid.xml weighs in at 132KB, uses namespaces, and looks like it was barfed onto the street (not so human readable).
Now, the benchmark. I will be writing and posting the benchmarking code, but the gist is this: load up the file into memory to eliminate the hard drive as a bottleneck, execute 10 iterations of parsing the document enough times to constitute at least 100MB of data. I intend to use the fastest configuration of the parser as possible, not the safest, and will keep the code open to allow suggested improvements from the community.
Tango has landed XML support in the tango.text.xml package. Current highlights include a pull parser, a DOM parser, and a SAX parser, as well as a budding XPath like package.
What makes these different you ask? Why another damn XML parser? Glad you asked. These components are intended to be high-speed, non-allocating tools that can be used at a server or appliance level with much less overhead than other solutions. For example, the SAX parser needs just a few KB of memory over and above the size of the content being parsed.
If you need a fast XML parser, check them out. I am still writing up my benchmarking output, so stay tuned for a post on that shortly.