Information icon IBM InfoSphere DataStage and InfoSphere QualityStage, Version 8.5
space Feedback

Modulus partitioner

Partitioning is based on a key column modulo the number of partitions. This method is similar to hash by field, but involves simpler computation.

In data mining, data is often arranged in buckets, that is, each record has a tag containing its bucket number. You can use the modulus partitioner to partition the records according to this number. The modulus partitioner assigns each record of an input data set to a partition of its output data set as determined by a specified key field in the input data set. This field can be the tag field.

The partition number of each record is calculated as follows:
partition_number = fieldname mod number_of_partitions 
where:
In this example, the modulus partitioner partitions a data set containing ten records. Four processing nodes run the partitioner, and the modulus partitioner divides the data among four partitions. The input data is as follows:
Table 1. Input data
Column name SQL type
bucket Integer
date Date

The bucket is specified as the key field, on which the modulus operation is calculated.

Here is the input data set. Each line represents a row:
Table 2. Input data set
bucket date
64123 1960-03-30
61821 1960-06-27
44919 1961-06-18
22677 1960-09-24
90746 1961-09-15
21870 1960-01-01
87702 1960-12-22
4705 1961-12-13
47330 1961-03-21
88193 1962-03-12
The following table shows the output data set divided among four partitions by the modulus partitioner.
Table 3. Output data set
Partition 0 Partition 1 Partition 2 Partition 3
  61821 1960-06-27 21870 1960-01-01 64123 1960-03-30
  22677 1960-09-24 87702 1960-12-22 44919 1961-06-18
  47051961-12-13 47330 1961-03-21  
  88193 1962-03-12 90746 1961-09-15  
Here are three sample modulus operations corresponding to the values of three of the key fields:

None of the key fields can be divided evenly by 4, so no data is written to Partition 0.


PDFThis topic is also in the IBM InfoSphere DataStage and QualityStage Parallel Job Developer's Guide.

Update timestamp Last updated: 2012-10-8