Preprocessing XML Document

An XML document needs to be converted into adjacency list representation in order to be processed in Quegel. We provide a SAX parsing program for this purpose, whose space cost is linear to the DOM tree height.

In the program, you need to specify "inputfile" as the input XML document, and if it cites any schema file like DTD, the file should also be provided. The adjacency list data is output to "index.txt", while "output.txt" is a compact version of the XML document that removes all white space characters. One may call getXML(.) over "output.txt" to extract the relevant XML element (subtree) using its start and end positions in "output.txt", and the position information is contained in "index.txt".

As an example, you may download the uwm.xml dataset from UW XML Data Repository, and run the Java parser over it. The parsed adjacency list file can then be put onto HDFS for use by Quegel, and our application code assumes that the XML data are put under HDFS path "/uwm".