Hash partitioner

Partitioning is based on a function of one or more columns (the hash partitioning keys) in each record. The hash partitioner examines one or more fields of each input record (the hash key fields). Records with the same values for all hash key fields are assigned to the same processing node.

This method is useful for ensuring that related records are in the same partition, which might be a prerequisite for a processing operation. For example, for a remove duplicates operation, you can hash partition records so that records with the same partitioning key values are on the same node. You can then sort the records on each node using the hash key fields as sorting key fields, then remove duplicates, again using the same keys. Although the data is distributed across partitions, the hash partitioner ensures that records with identical keys are in the same partition, allowing duplicates to be found.

Hash partitioning does not necessarily result in an even distribution of data between partitions. For example, if you hash partition a data set based on a zip code field, where a large percentage of your records are from one or two zip codes, you can end up with a few partitions containing most of your records. This behavior can lead to bottlenecks because some nodes are required to process more records than other nodes.

For example, the diagram shows the possible results of hash partitioning a data set using the field age as the partitioning key. Each record with a given age is assigned to the same partition, so for example records with age 36, 40, or 22 are assigned to partition 0. The height of each bar represents the number of records in the partition.

Shows a hash partitioner partitioning a data set using the age column as the key

As you can see, the key values are randomly distributed among the different partitions. The partition sizes resulting from a hash partitioner are dependent on the distribution of records in the data set so even though there are three keys per partition, the number of records per partition varies widely, because the distribution of ages in the population is non-uniform.

When hash partitioning, you should select hashing keys that create a large number of partitions. For example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a maximum age of 190), to yield 1,500,000 possible partitions.

Fields that can only assume two values, such as yes/no, true/false, male/female, are particularly poor choices as hash keys.

You must define a single primary partitioning key for the hash partitioner, and you might define as many secondary keys as are required by your job. Note, however, that each column can be used only once as a key. Therefore, the total number of primary and secondary keys must be less than or equal to the total number of columns in the row.

You specify which columns are to act as hash keys on the Partitioning tab of the stage editor The data type of a partitioning key might be any data type except raw, subrecord, tagged aggregate, or vector. By default, the hash partitioner does case-sensitive comparison. This means that uppercase strings appear before lowercase strings in a partitioned data set. You can override this default if you want to perform case insensitive partitioning on string fields.