The range prtitioner

A range partitioner divides a data set into approximately equal size partitions based on one or more partitioning keys. Range partitioning is often a preprocessing step to performing a total sort on a data set.

This topic describes the range partitioner, the partitioner that implements range partitioning. It also describes the writerangemap operator, which you use to construct the range map file required for range partitioning, and the stand-alone makerangemap utility.

The range partitioner guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size so all nodes perform an equal amount of work when processing the data set.

The diagram shows an example of the results of a range partition. The partitioning is based on the age key, and the age range for each partition is indicated by the numbers in each bar. The height of the bar shows the size of the partition.

Figure 1. Range Partitioning Example
Shows the results of an example range partitioning based on an age key. The data is grouped in age ranges in each partition

All partitions are of approximately the same size. In an ideal distribution, every partition would be exactly the same size. However, you typically observe small differences in partition size.

In order to size the partitions, the range partitioner orders the partitioning keys. The range partitioner then calculates partition boundaries based on the partitioning keys in order to evenly distribute records to the partitions. As shown above, the distribution of partitioning keys is often not even; that is, some partitions contain many partitioning keys, and others contain relatively few. However, based on the calculated partition boundaries, the number of records in each partition is approximately the same.

Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and roundrobin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However, these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition.