Derive node
One of the most powerful features in Cloud Pak for Data is the ability to modify data values and derive new fields from existing data. During lengthy data mining projects, it is common to perform several derivations, such as extracting a customer ID from a string of Web log data or creating a customer lifetime value based on transaction and demographic data. All of these transformations can be performed, using a variety of field operations nodes.
Several nodes provide the ability to derive new fields:
- The Derive node modifies data values or creates new fields from one or more existing fields. It creates fields of type formula, flag, nominal, state, count, and conditional.
- The Reclassify node transforms one set of categorical values to another. Reclassification is useful for collapsing categories or regrouping data for analysis.
- The Binning node automatically creates new nominal (set) fields based on the values of one or more existing continuous (numeric range) fields. For example, you can transform a continuous income field into a new categorical field containing groups of income as deviations from the mean. After you create bins for the new field, you can generate a Derive node based on the cut points.
- The Set to Flag node derives multiple flag fields based on the categorical values defined for one or more nominal fields.
- The Restructure node converts a nominal or flag field
into a group of fields that can be populated with the values of yet another field. For example,
given a field named
payment type
, with values ofcredit
,cash
, anddebit
, three new fields would be created (credit
,cash
,debit
), each of which might contain the value of the actual payment made.
Using the Derive node
Using the Derive node, you can create six types of new fields from one or more existing fields:
- Formula. The new field is the result of an arbitrary CLEM expression.
- Flag. The new field is a flag, representing a specified condition.
- Nominal. The new field is nominal, meaning that its members are a group of specified values.
- State. The new field is one of two states. Switching between these states is triggered by a specified condition.
- Count. The new field is based on the number of times that a condition has been true.
- Conditional. The new field is the value of one of two expressions, depending on the value of a condition.
Each of these nodes contains a set of special options. These options are discussed in subsequent topics.
Note that use of the following may change row order:
- Executing in a database via SQL pushback
- Executing via remote Analytic Server
- Using functions that run in embedded Analytic Server
- Deriving a list
- Calling spatial functions
Tip: The Control Language for Expression Manipulation (CLEM) is a powerful tool you can
use to analyze and manipulate the data used in your flows. For example, you might use CLEM in a node
to derive values. For more information, see the CLEM (legacy) language reference.