SIGMOD Record now accepts nominations for papers on data management that you think deserve to be placed in the spotlight. We should nominate theory papers from ICDT as well as from PODS! I just submitted the following nomination:
"Distributed Streaming with Finite Memory" by Frank Neven, Nicole Schweikardt, Frédéric Servais, and Tony Tan
Presented at ICDT 2015.
This paper proposes a tractable formal computational model for MapReduce computations. The focus is on computations where the worker nodes use bounded memory. This models the common situation where the data local to each node is processed in a streaming fashion. Formal computational models are useful because they allow to rigorously prove impossibility results. For example, it is well known how to compute a join in MapReduce. The algorithm distributes tuples on their join attribute in the Map phase; then all the Reduce phase has to do is to essentially output Cartesian products. Obviously, this Cartesian product phase can become a quadratic bottleneck if the data is skewed. The model in the paper allows us to prove formally that this quadratic bottleneck is unavoidable.
The paper contains many more insights. What if we just want to compute a semijoin? Then the output is linear. Suppose we semijoin R to S on a common attribute A. The reducer assigned to a particular A-value x can detect in a streaming fashion that the semijoin is nonempty; it simply checks that it has received both R- and S-tuples. However, once it has detected that, it is too late to produce the output consisting of all the R-tuples; it could not remember all of them because the memory is bounded. If the reducer gets a second pass over the data, it can now stream again over the R-tuples and output them. The necessity of a second pass is again something that can be rigorously proved.
Other results of the paper include a formal proof that allowing more rounds allows more and more queries to be computed. This paper is theoretical computer science at its best; it models a practical paradigm of computing and attempts to explain and gain understanding about what is possible and what is not when using this paradigm.