August 2016

Task force for the design of a query language for graph-structured data

Submitted by Pablo Barcelo on Tue, 08/30/2016 - 17:03

For more than a year already, I have been collaborating with people from industry and academia on the standardization of the data model and query language for graph-structured data. This group includes participants from world-leading graph database engines, in particular, Neo Technologies, SAP, Oracle, Sparsity Technologies, and IBM. It also includes graph database researchers from different academic institutions around the world, with interests ranging from the purely theoretical to the most applied. As the reader might imagine, it is not always easy understanding each other in such a diverse environment, in the same way that agreement is not always easily reached. Nevertheless, it is fair to say that discussions have, in general, been enriching and fruitful, and that the decisions taken by the group are often reasonable and well thought out. I am personally convinced that this diversity of views is a necessary condition for a standardization process of this kind to be successful: While developers bring the view from real-world applications and provide use cases for the features of the language, theoreticians define formal syntax and semantics, and establish the limits for unreasonably costly expressiveness needs.

So far, the group has managed to accomplish several tasks. First, we identified a graph data model that can be applied in a wide-range of practical applications. This corresponds to the so-called property graphs, which are essentially finite, directed, node- and edge-labeled graphs, with attributes in both nodes and edges. Then the group embarked in a vast revision of the literature and use cases as a way to understand what  features are needed in a graph query language. The basic such features identified correspond to: pattern matching, path-based navigation, aggregation, and transformations among different data types (e.g., tables, paths, and graphs themselves). Based on such features, the group is currently working on the design of a general-purpose query language which should be finished during the first semester of 2017.

The most rewarding conclusion I can take from this task force is the fact that an important part of the theoretical work done in graph databases over the last three decades is of relevance to practice. For instance, while well-studied regular path queries (RPQs) are not currently integrated in graph query languages (e.g., in Cypher), the group has agreed on their importance and will for sure become part of the standard. In the same way, recent work on graph query languages with data comparisons has helped establishing the limits of what this standard should express in terms of them. Finally, the work done by Mendelzon and Wood in the late 80s, regarding the high cost of interpreting RPQs under a simple path semantics, has provided us with good arguments for discarding such interpretation (that is often the one that developers prefer), inciting us to look for more efficient, yet practically relevant, versions of it.

Submitted by Jan Van den Bussche on Thu, 09/01/2016 - 14:16

Permalink

I would be interested in hearing how the graph query language standardization efforts relate or compare to the JSON query language standardization efforts, which these days goes under the header of "SQL++".

Choiceless polynomial time, rank logic, and other polynomial-time query languages

Submitted by Jan Van den Bussche on Mon, 08/08/2016 - 16:33

Hagenberg CastleJust came back from a visit to Flavio Ferrarotti in Hagenberg, Austria. Hagenberg is an inspirational mixture of a traditional countryside village and a high-tech software campus. Flavio works in finite model theory and during our discussions I was reminded of the beautiful formalism of Choiceless Polynomial Time with Counting invented by Gurevich and Shelah. Basically it is programming with sets, much like in the programming language SETL, but in a logical way, i.e., atomic data elements are abstract entities and not numbers that can be compared, added, multiplied, and so on.

The conjecture is that not every polynomial-time database query can be expressed in Choiceless Polynomial Time with Counting; for nonboolean queries, this has been proved by Benjamin Rossman but it remains open for boolean queries.

Yuri Gurevich actually conjectures that there is no language for the polynomial time boolean queries, but proving this is of course quite a challenge, in view of the fact that there does exist a language for the NP boolean queries (existential second-order logic, compare Fagin's Theorem).

More recently, Anuj Dawar and collaborators investigated an interesting alternative approach, based on adding primitives from linear algebra (notably, computing the rank of a matrix) to fixpoint logic. I have not followed this closely but would be interested in hearing the latest feeling/opinion on this development. It would be great if Anuj could chime in with a brief comment bringing us up to date on this matter!

Submitted by Anuj Dawar on Mon, 08/22/2016 - 19:41

Permalink

Jan, thank you for inviting me to comment on this.  It is great to see this blog up and running.

The relationship between choiceless polynomial time with counting (CPTC) on the one hand and fixed-point logic with rank (FPR) on the other is certainly interesting.  Neither one has been shown included in the other, and it would be really interesting to show separations.  One point I should stress is that the primitives from linear algebra that are added to fixed-point logic are operators for the rank of a matrix *over finite fields*.  The reason is that this is what fixed-point logic with counting (FPC) is unable to express.  Linear algebra over the rationals turns out to be easier to define and most of what we want to do is already definable in FPC.

For most recent work on CPTC and FPR, I strongly recommend Wied Pakusa's PhD thesis available from his webpage:

https://logic.rwth-aachen.de/People/Pakusa/