Sorting data

Look at job designs and try to reorder the job flow to combine operations around the same sort keys if possible, and coordinate your sorting strategy with your hashing strategy.

It is sometimes possible to rearrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings.

If data has already been partitioned and sorted on a set of key columns, specify the "don't sort, previously sorted" option for the key columns in the Sort stage. This reduces the cost of sorting and takes greater advantage of pipeline parallelism.

When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using Same partitioning method.

The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort.

The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.