query languages | PRINCIPLES of DATA MANAGEMENT -- databasetheory.org

FO3 and the algebra of binary relations

Illustration of binary relation — *Image credit: Why U*

The algebra of binary relations, denoted here simply by RA, is one of the oldest logics, created by De Morgan, Pierce and Schroder in the 19th century. Tarski and his collaborators revived the logic in the 20th century. Nowadays RA is still very relevant as it lies at the basis of dynamic logics, description logics, and XPath-like query languages for tree and graph data.

A very interesting result by Tarski and Givant is the equivalence between RA and $\mathrm{FO}^3$, the three-variable fragment of first-order logic. The monograph by Tarski and Givant on the formalization of set theory without variables contains a proof, but that monograph is not for the impatient. The book by Marx and Venema on multi-dimensional modal logic contains an elegant and short proof. As that book is again not something you can readily dive in and start reading somewhere in the middle, we give here a self-contained version of the Marx-Venema proof.

Let us begin by recalling the definition of RA. We start from a binary relational vocabulary $\Upsilon$. Recall that an $\Upsilon$-structure $I$ consists of a nonempty domain of atomic data elements, denoted by $\mathrm{dom}(I)$, and, for every relation name $R$ from $\Upsilon$, a binary relation $I(R)$ on $\mathrm{dom}(I)$. We can think of $\Upsilon$ as a database schema with only binary relation names, and of $I$ as an instance of that schema. The only difference is that usually database instances are defined without an explicit domain, instead using the active-domain semantics for first-order logic. Below we continue to work with structures with explicit domains, but everything can be adapted to work with active-domain semantics instead.

The expressions $e$ of RA are generated by the following grammar:

$$ e ::= R \mid e \cup e \mid \neg e \mid e \circ e \mid e^{-1} \mid {\it id} \mid {\it all} $$ Here, as above, the symbol $R$ ranges over the relation names from $\Upsilon$. Given any $\Upsilon$-structure $I$, an RA-expression $e$ defines a binary relation $e(I)$ on $\mathrm{dom}(I)$ in a natural manner. Obviously, $\cup$ is union; $\neg$ is complement with respect to $\mathrm{dom}(I)^2$. The operator $\circ$ is the composition of two binary relations, i.e., $r \circ s = \{(a,c) \mid \exists b : (a,b) \in r$ and $(b,c) \in s\}$. The operator ${}^{-1}$, called converse, inverts a binary relation, i.e., $r^{-1} = \{(b,a) \mid (a,b) \in r\}$. Finally, the primitives ${\it id}$ and ${\it all}$ evaluate to the identity relation and the complete binary relation on $\mathrm{dom}(I)$, respectively.

It is not difficult to see that RA is subsumed by $\mathrm{FO}^3$, in that for every RA-expression $e$ there exists a first-order logic formula $\varphi$ with two free variables and three variables overall, so that on every structure, $\varphi$ defines the same relation as $e$. For example, the expression $R \circ (S \circ T)$ can be equivalently expressed as $$ \{(x,y) \mid \exists z(R(x,z) \land \exists x(S(z,x) \land T(x,y)))\}. $$ Note the reuse of the variable $x$. The interesting result is that the converse holds as well: for every $\varphi$ as above there exists an $e$ defining the same relation on every structure.

To prove this result, let us fix the three available variables as $x$, $y$ and $z$. We begin by defining, for every two-element subset $\{u,v\} \subseteq \{x,y,z\}$, the logic $L_{\{u,v\}}$. This logic contains all atomic formulas over $\Upsilon$ involving the variables $u$ and $v$, as well as an explicit proposition $\top$ for true; is closed under the boolean connectives; and also contains all quantified formulas of the form $\exists w(\gamma \land \chi)$, where $w$ is the third variable, i.e., the variable not in $\{u,v\}$, and $\gamma$ and $\chi$ are formulas in $L_{\{u,w\}}$ and $L_{\{w,v\}}$, respectively. Thus, the three logics $L_{\{x,y\}}$, $L_{\{x,z\}}$ and $L_{\{y,z\}}$ are defined by mutual recursion. Note that the only variables that can occur free in a formula in $L_{\{u,v\}}$ are $u$ and $v$.

It is not difficult to see that, for every formula $\varphi$ in $L_{\{u,v\}}$, we can express $\{(u,v) \mid \varphi\}$ in RA. For example, we can express $$ \{(x,y) \mid \exists z(R(z,x) \land \top) \land y=x\} $$ by $(R^{-1}\circ {\it all}) \cap {\it id}$, where we use intersection $\cap$ as a shorthand which can be expressed in terms of union and complement.

We now claim that every $\mathrm{FO}^3$ formula over $\Upsilon$ is equivalent to a boolean combination of formulas from $L_{\{x,y\}}$, $L_{\{x,z\}}$ and $L_{\{y,z\}}$. This claim can be proven by structural induction. The basis of the induction is formed by the atomic formulas. Since $\Upsilon$ has only binary relation names, every atomic formula is always in $L_{\{x,y\}}$, $L_{\{x,z\}}$ or $L_{\{y,z\}}$ as desired. For the induction step, the boolean connectives pose no problem. So, the only case that is not immediately clear is when $\varphi$ is of the form $\exists w\, \psi$. Let $u$ and $v$ be the two variables other than $w$. By induction, we can write $\psi$ as a boolean combination of formulas from $L_{\{u,v\}}$, $L_{\{u,w\}}$ and $L_{\{w,v\}}$. We can put this boolean combination in disjunctive normal form and distribute the existential quantifier over the disjuncts, so that we are left with formulas of the form $\exists w(\beta \land \gamma \land \chi)$, where $\beta$, $\gamma$ and $\chi$ are from $L_{\{u,v\}}$, $L_{\{u,w\}}$ and $L_{\{w,v\}}$, respectively. We can rewrite this formula as $\beta \land \exists w(\gamma \land \chi)$ which is a formula in $L_{\{u,v\}}$ as desired.

The result now follows. Let $\varphi$ be an $\mathrm{FO}^3$ formula with two free variables $u$ and $v$. Let $w$ be the third variable. By the above claim, we can write $\varphi$ as a boolean combination of formulas from $L_{\{u,v\}}$, $L_{\{u,w\}}$ and $L_{\{w,v\}}$. Of these constituent formulas, those that are in $L_{\{u,w\}}$ can have only $u$ free, so they can be rewritten as $L_{\{u,v\}}$-formulas by swapping $w$ and $v$. Similarly, the constituent formulas that are in $L_{\{v,w\}}$ can be rewritten as $L_{\{u,v\}}$-formulas by swapping $w$ and $u$. The final result is an $L_{\{u,v\}}$-formula, which concludes the proof, as we have seen earlier that $\{(u,v)\mid \varphi\}$, with $\varphi$ an $L_{\{u,v\}}$-formula, is expressible in RA.

Thanks to Floris Geerts for suggesting that I write this blog post.

Task force for the design of a query language for graph-structured data

For more than a year already, I have been collaborating with people from industry and academia on the standardization of the data model and query language for graph-structured data. This group includes participants from world-leading graph database engines, in particular, Neo Technologies, SAP, Oracle, Sparsity Technologies, and IBM. It also includes graph database researchers from different academic institutions around the world, with interests ranging from the purely theoretical to the most applied. As the reader might imagine, it is not always easy understanding each other in such a diverse environment, in the same way that agreement is not always easily reached. Nevertheless, it is fair to say that discussions have, in general, been enriching and fruitful, and that the decisions taken by the group are often reasonable and well thought out. I am personally convinced that this diversity of views is a necessary condition for a standardization process of this kind to be successful: While developers bring the view from real-world applications and provide use cases for the features of the language, theoreticians define formal syntax and semantics, and establish the limits for unreasonably costly expressiveness needs.

So far, the group has managed to accomplish several tasks. First, we identified a graph data model that can be applied in a wide-range of practical applications. This corresponds to the so-called property graphs, which are essentially finite, directed, node- and edge-labeled graphs, with attributes in both nodes and edges. Then the group embarked in a vast revision of the literature and use cases as a way to understand what features are needed in a graph query language. The basic such features identified correspond to: pattern matching, path-based navigation, aggregation, and transformations among different data types (e.g., tables, paths, and graphs themselves). Based on such features, the group is currently working on the design of a general-purpose query language which should be finished during the first semester of 2017.

The most rewarding conclusion I can take from this task force is the fact that an important part of the theoretical work done in graph databases over the last three decades is of relevance to practice. For instance, while well-studied regular path queries (RPQs) are not currently integrated in graph query languages (e.g., in Cypher), the group has agreed on their importance and will for sure become part of the standard. In the same way, recent work on graph query languages with data comparisons has helped establishing the limits of what this standard should express in terms of them. Finally, the work done by Mendelzon and Wood in the late 80s, regarding the high cost of interpreting RPQs under a simple path semantics, has provided us with good arguments for discarding such interpretation (that is often the one that developers prefer), inciting us to look for more efficient, yet practically relevant, versions of it.

relationship to JSON efforts?

I would be interested in hearing how the graph query language standardization efforts relate or compare to the JSON query language standardization efforts, which these days goes under the header of "SQL++".

Read more about Task force for the design of a query language for graph-structured data
1 comment
Log in or register to post comments