# Joins as Matrices

Warning: if you're unfamiliar with SQL Joins, go have a look at the Venn Diagram explanation here.

We'll start with the last example, Cartesian Joins. Recall the definition of a Cartesian Product:

$X\times Y = \{\,(x,y)\mid x\in X \ \text{and} \ y\in Y\,\}.$

In other words, it's a set of every element from the first set paired with every element of the second set. It's fairly easy to transform this into a matrix. Supposing you have two column vectors A and B, which are the sets X and Y, do a standard matrix multiplication of A and B transposed. For a 3x3 matrix:

$\begin{bmatrix}a \\ b \\ c\end{bmatrix} * {\begin{bmatrix}d \\ e \\ f\end{bmatrix}}^T = \begin{bmatrix}ad & ae & af \\ bd & be & bf \\ cd & ce & cf\end{bmatrix}$

So using the table data from Atwood's post, a cross join / Cartesian product looks like this:

select * from tableA cross join tableB;
+----+-----------+----+-------------+
| id | name      | id | name        |
+----+-----------+----+-------------+
|  1 | Pirate    |  1 | Rutabaga    |
|  2 | Monkey    |  1 | Rutabaga    |
|  3 | Ninja     |  1 | Rutabaga    |
|  4 | Spaghetti |  1 | Rutabaga    |
|  1 | Pirate    |  2 | Pirate      |
|  2 | Monkey    |  2 | Pirate      |
|  3 | Ninja     |  2 | Pirate      |
|  4 | Spaghetti |  2 | Pirate      |
|  1 | Pirate    |  3 | Darth Vader |
|  2 | Monkey    |  3 | Darth Vader |
|  3 | Ninja     |  3 | Darth Vader |
|  4 | Spaghetti |  3 | Darth Vader |
|  1 | Pirate    |  4 | Ninja       |
|  2 | Monkey    |  4 | Ninja       |
|  3 | Ninja     |  4 | Ninja       |
|  4 | Spaghetti |  4 | Ninja       |
+----+-----------+----+-------------+


Note: the cross join here can be rewritten as select * from tableA, tableB;

which can be turned into a Matrix with each cell containing the tuple (idA, nameA, idB, nameB):

$M = \begin{bmatrix}(1, Pirate, 1, Rutabaga) & (1, Pirate, 2, Pirate) & (1, Pirate, 3, Darth Vader) & (1, Pirate, 4, Ninja) \\ (2, Monkey, 1, Rutabaga) & (2, Monkey, 2, Pirate) & (2, Monkey, 3, Darth Vader) & (2, Monkey, 4, Ninja) \\ (3, Ninja, 1, Rutabaga) & (3, Ninja, 2, Pirate) & (3, Ninja, 3, Darth Vader) & (3, Ninja, 4, Ninja) \\ (4, Spaghetti, 1, Rutabaga) & (4, Spaghetti, 2, Pirate) & (4, Spaghetti, 3, Darth Vader) & (4, Spaghetti, 4, Ninja)\end{bmatrix}$

Phew!

The advantages of constructing this matrix is that it makes it very easy to see what "cells" (aka records) are selected based on the join condition. The entire set of records is taken on no joins, if we stipulate a join condition that "nameA = nameB" then only two cells are returned: $M(0,1)$ and $M(2,3)$ (the matching Pirates and Ninjas) (where $M(i,j)$ returns the cell in the ith row and jth column).

It's also easy to see how to construct the result set from "weird" join conditions; that is, join conditions that aren't strict equality checks on keys that can take advantage of DB indexing magic. As an example, consider joining when the distance between idA and idB is more than or equal to 2. That is, select * from tableA, tableB where abs(tableA.id - tableB.id) >= 2. (Alternatively cross join on abs....) The grid should light up like this, where black cells are matching cells:

(Kind of pretty, no?)

The main disadvantage to this matrix approach is when we have big tables with lots of data. Computers have finite memory; good luck dealing with millions by millions size matrices! There is a very clever solution to this problem, however, which can make use of the MapReduce framework. It's called the 1-bucket theta algorithm! (And its cousins, M-bucket output/input.) I'll write about that when I understand it better and have a simple demo program or at least some cool numbers.

#### Posted on 2011-10-25 by Jach

Tags: databases, programming

LaTeX allowed in comments, use $\\...\\$\$ to wrap inline and $$...$$ to wrap blocks.