Discretize operator

With the Discretize operator, you can split the value range of a column into intervals and then map each interval to a discrete value.

The Discretize operator is used if there are too much different values in a column of a virtual table. It maps a range of values (interval) to a new single value. For instance, a table with bank customer data may contain a column with the average balance of the customer's account in a certain time period. Almost all customers will have different values in this column. You might want to consider only a few different ranges of average balances but not the exact values. Then you can use the Discretize operator to generate a new column that contains the information in which range or interval the customers balance falls.

You can discretize only numerical columns. However, the new column with the mapped values ("discretized values") can be of numerical or categorical type.

There are three methods to define the intervals:

The intervals can be defined manually: You specify each interval separately: the lower limit, the upper limit, and the new discretized value for the interval.

The intervals can be calculated from some basic values to be specified: You specify the number of intervals to be generated: the lower limit of the value range, the upper limit of the value range, and a prefix used to generate the new discretized values. Then, the system calculates intervals of equal width. Additionally, it generates one interval for all values below the lower limit and one interval for all values above the upper limit. For each interval, the new discretized value is generated by appending a consecutive number to the prefix.

The intervals can be automatically calculated based on statistical analysis: You only specify the approximate number of intervals to be generated and the prefix for the new discretized values: the exact number of intervals, the lower and upper value range limit, and the interval width are calculated based on the distribution statistics of the data. This method generates equally wide intervals with smooth boundaries. It is especially useful if the range of field values is unknown or if you want the interval generation based on data that is located in a different column or table.

Even if you have selected one of the methods that calculate intervals, you can still modify the resulting intervals manually later.

Some mining algorithms discretize input data internally, if necessary. The distribution-based clustering algorithm treats numerical columns almost like categorical columns by categorizing their values into buckets. The Associations mining function and the Sequential Patterns mining function also provide a discretization mechanism for numeric columns: If a numeric column contains more than 20 different values, the value range is automatically divided into buckets.

Feedback