Preprocessing RDF Data

One may find RDF data such as those described on Page 16 of our technical report. We assume the data is in N-Quads or N-Triples format, where each line is a triplet or quadruplet statement. Header file triple.h defines the data structure of a statement, as well as how to parsing a line into a statement, and should be included in all later programs.

We first need to group (e.g., by sorting) the triplets or quadruplets by subject, to obtain the adjacency list of each subject. Compile triple_sort.cpp [N-Triples, N-Quads] using the Quegel system code for the TeraSort program, and run it to sort the triplets or quadruplets. Then, we merge each group into an adjacency list, by a single-thread C++ program [N-Triples, N-Quads] (compiled with the Quegel system code) that reads the sorted HDFS triplets/quadruplets.

Now, each vertex is associated with an adjacency list, but the ID is a string (i.e. the subject text) and we need to assign an integer ID to each vertex, and replace the string IDs in each adjacency list with the corresponding integer IDs. Compile stringID2int.cpp [N-Triples, N-Quads] using the Quegel system code for this program.

So far, each vertex is associated with its out-neighbors, and we need its in-neighbors since the graph keyword search application sends messages to in-neighbors. Compile out2in.cpp [N-Triples, N-Quads] using the Quegel system code for this program.