Several join methods are available for partitioned database
environments, including: collocated joins, broadcast outer-table joins,
directed outer-table joins, directed inner-table and outer-table joins,
broadcast inner-table joins, and directed inner-table joins.
In the following diagrams, q1, q2, and q3 refer to table queues.
The referenced tables are divided across two database partitions,
and the arrows indicate the direction in which the table queues are
sent. The coordinator database partition is database partition 0.
If the join method chosen by the compiler
is hash join, the filters created at each remote database partition
may be used to eliminate tuples before they are sent to the database
partition where the hash join is processed, thus improving performance.
Collocated joins
A collocated join occurs
locally on the database partition on which the data resides. The database
partition sends the data to the other database partitions after the
join is complete. For the optimizer to consider a collocated join,
the joined tables must be collocated, and all pairs of the corresponding
distribution keys must participate in the equality join predicates. Figure 1 provides an example.
The LINEITEM and ORDERS tables are both partitioned on the ORDERKEY
column. The join is performed locally at each database partition.
In this example, the join predicate is assumed to be: orders.orderkey
= lineitem.orderkey.
Replicated materialized query
tables (MQTs) enhance the likelihood of collocated joins.
Broadcast outer-table joins
Broadcast outer-table
joins represent a parallel join strategy that can be used if there
are no equality join predicates between the joined tables. It can
also be used in other situations in which it proves to be the most
cost-effective join method. For example, a broadcast outer-table join
might occur when there is one very large table and one very small
table, neither of which is split on the join predicate columns. Instead
of splitting both tables, it might be cheaper to broadcast the smaller
table to the larger table. Figure 2 provides
an example.
The ORDERS table is sent to all database partitions that have
the LINEITEM table. Table queue q2 is broadcast to all database partitions
of the inner table.
Directed outer-table joins
In the directed
outer-table join strategy, each row of the outer table is sent to
one portion of the inner table, based on the splitting attributes
of the inner table. The join occurs on this database partition. Figure 3 provides an example.
Directed inner-table and outer-table joins
In
the directed inner-table and outer-table join strategy, rows of both
the outer and inner tables are directed to a set of database partitions,
based on the values of the joining columns. The join occurs on these
database partitions. Figure 4 provides
an example.
Neither table is partitioned on the ORDERKEY column. Both tables
are hashed and sent to new database partitions, where they are joined.
Both table queue q2 and q3 are directed. In this example, the join
predicate is assumed to be: orders.orderkey = lineitem.orderkey.
Broadcast inner-table joins
In the broadcast
inner-table join strategy, the inner table is broadcast to all the
database partitions of the outer table. Figure 5 provides an example.
Directed inner-table joins
In the directed
inner-table join strategy, each row of the inner table is sent to
one database partition of the outer table, based on the splitting
attributes of the outer table. The join occurs on this database partition. Figure 6 provides an example.