The ability to process large volumes of data in a short period
of time depends on all aspects of the flow and the environment being optimized
for maximum throughput and performance. Performance tuning and optimization
are iterative processes that begin with job design and unit tests, proceed
through integration and volume testing, and continue throughout the production
life cycle of the application. Here are some performance pointers:
- When writing intermediate results that will only be shared
between parallel jobs, always write to persistent data sets (using Data Set
stages). You should ensure that the data is partitioned, and that the partitions,
and sort order, are retained at every stage. Avoid format conversion or serial
I/O.
- Data Set stages should be used to create restart points
in the event that a job or sequence needs to be rerun. But, because data sets
are platform and configuration specific, they should not be used for long-term
backup and recovery of source data.
- Depending on available system resources, it might be possible
to optimize overall processing time at run time by allowing smaller jobs to
run concurrently. However, care must be taken to plan for scenarios when source
files arrive later than expected, or need to be reprocessed in the event of
a failure.
- Parallel configuration files allow the degree of parallelism
and resources used by parallel jobs to be set dynamically at runtime. Multiple
configuration files should be used to optimize overall throughput and to match
job characteristics to available hardware resources in development, test,
and production modes.
The proper configuration of scratch
and resource disks and the underlying filesystem and physical hardware architecture
can significantly affect overall job performance.
Within
clustered ETL and database environments, resource-pool naming can be used
to limit processing to specific nodes, including database nodes when appropriate.