Sim Fit node

The Simulation Fitting node fits a set of candidate statistical distributions to each field in the data. The fit of each distribution to a field is assessed using a goodness of fit criterion. When a Simulation Fitting node runs, a Simulation Generate node is built (or an existing node is updated). Each field is assigned its best fitting distribution. The Simulation Generate node can then be used to generate simulated data for each field.

Although the Simulation Fitting node is a terminal node, it does not add output to the Outputs panel, or export data.

Note: If the historical data is sparse (that is, there are many missing values), it may be difficult for the fitting component to find enough valid values to fit distributions to the data. In cases where the data is sparse, before fitting you should either remove the sparse fields if they are not required, or impute the missing values. Using the QUALITY options in the Data Audit node, you can view the number of complete records, identify which fields are sparse, and select an imputation method. If there are an insufficient number of records for distribution fitting, you can use a Balance node to increase the number of records.

Using a Sim Fit node to automatically create a Sim Gen node

The first time the Simulation Fitting node is run, a Simulation Generate node is generated with an update link to the Simulation Fitting node. If the Simulation Fitting node is run again, a new Simulation Generate node will be generated only if the update link has been removed. You can also use a Simulation Fitting node to update a connected Simulation Generate node. The result depends on whether the same fields are present in both nodes, and if the fields are unlocked in the Simulation Generate node. See Sim Gen node for more information.

A Simulation Fitting node can only have an update link to a Simulation Generate node. To define an update link to a Simulation Generate node, follow these steps:

  1. Right-click the Simulation Fitting node and select Define Update Link.
  2. Click the Simulation Generate node to which you want to define an update link.

To remove an update link between a Simulation Fitting node and a Simulation Generate node, right-click the update link and select Remove Link.

Distribution fitting

A statistical distribution is the theoretical frequency of the occurrence of values that a variable can take. In the Simulation Fitting node, a set of theoretical statistical distributions is compared to each field of data. The parameters of the theoretical distribution are adjusted to give the best fit to the data according to a measurement of the goodness of fit; either the Anderson-Darling criterion or the Kolmogorov-Smirnov criterion. The results of the distribution fitting by the Simulation Fitting node show which distributions were fitted, the best estimates of the parameters for each distribution, and how well each distribution fits the data. During distribution fitting, correlations between fields with numeric storage types, and contingencies between fields with a categorical distribution, are also calculated. The results of the distribution fitting are used to create a Simulation Generate node.

Before any distributions are fitted to your data, the first 1000 records are examined for missing values. If there are too many missing values, distribution fitting is not possible. If so, you must decide whether either of the following options are appropriate:
  • Use an upstream node to remove records with missing values
  • Use an upstream node to impute values for missing value.
Distribution fitting does not exclude user-missing values. If your data has user-missing values and you want those values to be excluded from distribution fitting, then you should set those values to system missing.

The role of a field is not taken into account when the distributions are fitted. For example, fields with the role Target are treated the same as fields with roles of Input, None, Both, Partition, Split, Frequency, and ID.

Fields are treated differently during distribution fitting according to their storage type and measurement level. The treatment of fields during distribution fitting is described in the following table.

Table 1. Distribution fitting according to storage type and measurement level of fields
Storage type     Measurement Level      
  Continuous Categorical Flag Nominal Ordinal Typeless
String Impossible   Categorical, dice and fixed distributions are fitted      
Integer          
Real          
Time All distributions are fitted. Correlations and contingencies are calculated. The categorical distribution is fitted. Correlations are not calculated.   Binomial, negative binomial and Poisson distributions are fitted, and correlations are calculated. Field is ignored and not passed to the Simulation Generate node.
Date          
Timestamp          
Unknown   Appropriate storage type is determined from the data.    

Fields with the measurement level ordinal are treated like continuous fields and are included in the correlations table in the Simulation Generate node. If you want a distribution other than binomial, negative binomial, or Poisson to be fitted to an ordinal field, you must change the measurement level of the field to continuous. If you have previously defined a label for each value of an ordinal field, and then change the measurement level to continuous, the labels will be lost.

Fields that have single values are not treated differently during distribution fitting to fields with multiple values. Fields with the storage type time, date, or timestamp are treated as numeric.

Fitting distributions to split fields

If your data contains a split field, and you want distribution fitting to be carried out separately for each split, you must transform the data by using an upstream Restructure node. Using the Restructure node, generate a new field for each value of the split field. You can then use this restructured data for distribution fitting in the Simulation Fitting node.