SPSS Modeler flows add-on
You can build a machine learning model as a flow by using the SPSS Modeler to conveniently prepare data, train the model, and evaluate it.
For a compatibility report of data sources supported by SPSS Modeler in Watson Studio Local, see Software Product Compatibility.
Requirements
SPSS Modeler runtime is created for each user per project. Each Modeler runtime consumes 1 CPU and 5GB of memory.
Install the SPSS Modeler add-on
Watson Studio Local provides the Community Edition of the SPSS Modeler as an add-on. It has a limit of 5000 rows. A full Enterprise Edition is available as a separate purchase, and can be installed over the Community Edition (ensure the current SPSS runtimes are stopped) without having to uninstall the Community Edition first.
# Generate and encrypt the key first.
$ ssh-keygen
# Copy ssh key to remote ip (do this for each node on the cluster)
ssh-copy-id *node-ip*
#If you used a non-standard path or want to use a specific ssh key instead:
ssh -i *location-of-ssh-key* *node-ip*
SPSS Modeler Flows is available as community edition as well as enterprise edition. Community edition has a limitation of supporting up to 5000 rows only.
To install the SPSS Modeler Flows add-on into Watson Studio Local, complete the following steps:
- Sign in as root to the first master node of the Watson Studio Local cluster as a user who is a member of the docker group for all nodes of the cluster.
- On the first master node, create the /root/canvas-install directory and download the SPSS Modeler installation package into it.
- Extract the contents of the SPSS Modeler installation package, for example:
tar zxvf SPSS_MDL_DSE_1.0_LINUX_X86-64_EN.tar.gz
- Go to the directory cd canvas_package and run install.sh.
Build a model with the SPSS Modeler
- Open a project in the Watson Studio Local client. SPSS Modeler will be listed as a new runtime in the Environments tab of your project where you can start/stop the modeler or configure the resources.
- If your data is not already part of the project, add data sets to the data assets for the project. Supported types of data files are Var (.csv), Excel (.xlsx), Statistics (.sav), and SAS (.sd2, .ssd, *.*pt, *.sas7bdat).
- Create the flow:
- From your project, click the Assets tab and click Modeler flows.
- Select to create a machine learning flow to build your model. You can create a new machine learning flow from blank or import your existing SPSS stream (.str) file.
- Type a name and description for your machine learning flow.
- Select IBM SPSS Modeler.
- Click the Create button. The SPSS Modeler tool opens so that you can build your flow.
- Add the data from your project to the SPSS Modeler. Click the Find and Add Data icon () for a list of the data sets or connections to choose from.
- Open the node palette by clicking the palette icon ().
- From the node palette, select a record operations, field operations, graphs, modeling, export, outputs node, or export node and drag it to the SPSS Modeler. See The node palette for descriptions.
- From the SPSS Modeler, double-click a node to specify its properties.
- Draw a connector from the data set to the node.
- Continue to add operators or other nodes as needed to build your model.
Options for building a model
- You can group related nodes together into a supernode, which is represented by a star icon.
- You can run any terminal node within the SPSS Modeler without running the entire model. Right-click the node and select Run.
- To view the results of an Outputs node, run the node, such as a table node, and then click the View outputs and versions icon (). In the side palette, on the Outputs tab, click the object, such as a table, to open it.
- To save a version of a machine learning flow, click the View outputs and versions icon (). In the side palette, on the Versions tab, save the version.
- The following SPSS-created models support the Save as Model option: Bayes Net, C5.0, C&R Tree, CHAID, Tree-AS, Random Trees, Decision List, GLE, Linear, Linear-AS, LSVM, Logistic, Neural Net, KNN, Cox, SVM, Discriminant, Association Rules, Apriori, Sequence, Kohonen, Anomaly, K-Means, TwoStep, TwoStep-AS.
The node palette
- Import nodes
- Cognos
- The IBM Cognos source node enables you to bring Cognos database data or single list reports into your data mining session. In this way, you can combine the business intelligence features of Cognos with the predictive analytics capabilities of IBM SPSS Modeler. You can import relational, dimensionally-modeled relational (DMR), and OLAP data.
- TM1
- The IBM Cognos TM1 source node enables you to bring Cognos TM1 data into your data mining session. In this way, you can combine the enterprise planning features of Cognos with the predictive analytics capabilities of IBM SPSS Modeler. You can import a flattened version of the multidimensional OLAP cube data.
- Weather Data
- The TWC source node imports weather data from The Weather Company, an IBM Business.
- Record Operations nodes
- User Input
- The User Input node provides an easy way for you to create synthetic data, either from scratch or by altering existing data. This is useful, for example, when you want to create a test data set for modeling.
- Select
- You can use Select nodes to select or discard a subset of records from the data stream based on a specific condition, such as BP (blood pressure) = "HIGH".
- Sample
- The Sample node selects a subset of records. A variety of sample types are supported, including stratified, clustered, and nonrandom (structured) samples. Sampling can be useful to improve performance, and to select groups of related records or transactions for analysis.
- Sort
- You can use Sort nodes to sort records into ascending or descending order based on the values of one or more fields. For example, Sort nodes are frequently used to view and select records with the most common data values. Typically, you first aggregate the data by using the Aggregate node and then use the Sort node to sort the aggregated data into descending order of record counts. Display these results in a table so you can explore the data and make decisions, such as selecting the records of the top 10 best customers.
- Balance
- The Balance node corrects imbalances in a data set, so it conforms to a specified condition. The balancing directive adjusts the proportion of records where a condition is true by the factor specified.
- Distinct
- Duplicate records in a data set must be removed before data mining can begin. For example, in a marketing database, individuals may appear multiple times with different address or company information. You can use the Distinct node to find or remove duplicate records in your data, or to create a single, composite record from a group of duplicate records.
- Aggregate
- Aggregation is a data preparation task that is frequently used to reduce the size of a data set. Before proceeding with aggregation, you should take time to clean the data, concentrating especially on missing values. When you aggregate, potentially useful information regarding missing values might be lost.
- Merge
- The Merge node takes multiple input records and creates a single output record that contains some or all of the input fields. It is useful for merging data from different sources, such as internal customer data and purchased demographic data.
- Append
- The Append node allows multiple data sets to be appended together (similar to 'UNION' in SQL). For example, a customer may have sales data in separate files for each month and wants to combine them into a single view of sales over several years.
- Streaming TS
- You use the Streaming Time Series node to build and score time series models in one step. A separate time series model is built for each target field, however model nuggets are not added to the generated models palette and the model information cannot be browsed.
- SMOTE
- The Synthetic Minority Over-sampling Technique (SMOTE) node provides an over-sampling algorithm to deal with imbalanced data sets. It provides an advanced method for balancing data. The SMOTE process node is implemented in Python and requires the imbalanced-learn© Python library.
- Complex Sample
- Complex sample options allow for finer control of the sample, including clustered, stratified, and weighted samples along with other options.
- Field Operations nodes
- Auto Data Prep
- Preparing data for analysis is one of the most important steps in any project—and traditionally, one of the most time consuming. Automated Data Preparation (ADP) handles the task for you, analyzing your data and identifying fixes, screening out fields that are problematic or not likely to be useful, deriving new attributes when appropriate, and improving performance through intelligent screening techniques. You can use the algorithm in fully automatic fashion, allowing it to choose and apply fixes, or you can use it in interactive fashion, previewing the changes before they are made and accept or reject them as you want.
- Type
- The Type node specifies field metadata and properties. For example, you can specify a measurement level (continuous, nominal, ordinal, or flag) for each field, set options for handling missing values and system nulls, set the role of a field for modeling purposes, specify field and value labels, and specify values for a field. In some cases you might need to fully instantiate the Type node in order for other nodes to work correctly, such as the fields from property of the Set to Flag node. You can simply connect a Table node and execute it to instantiate the fields.
- Filter
- Using a Filter node, you can rename or filter fields at any point in the stream
- Derive
- The Derive node modifies data values or creates new fields from one or more existing fields. It creates fields of type formula, flag, nominal, state, count, and conditional.
- Filler
- Filler nodes are used to replace field values and change storage. You can choose to replace values based on a specified CLEM condition, such as `@BLANK(FIELD)`. Alternatively, you can choose to replace all blanks or null values with a specific value. Filler nodes are often used in conjunction with the Type node to replace missing values. For example, you can fill blanks with the mean value of a field by specifying an expression such as `@GLOBAL_MEAN`. This expression will fill all blanks with the mean value as calculated by a Set Globals node.
- Reclassify
- The Reclassify node enables the transformation from one set of categorical values to another. Reclassification is useful for collapsing categories or regrouping data for analysis. For example, you could reclassify the values for Product into three groups, such as Kitchenware, Bath and Linens, and Appliances. Often, this operation is performed directly from a Distribution node by grouping values and generating a Reclassify node.
- Binning
- The Binning node groups individual data values into one instance of a graphic element. A bin may be a point that indicates the number of cases in the bin. Or it may be a histogram bar, whose height indicates the number of cases in the bin.
- Ensemble
- The Ensemble node combines two or more model nuggets to obtain more accurate predictions than can be gained from any of the individual models. By combining predictions from multiple models, limitations in individual models may be avoided, resulting in a higher overall accuracy. Models combined in this manner typically perform at least as well as the best of the individual models and often better.
- Partition
- Partition nodes are used to generate a partition field that splits the data into separate subsets or samples for the training, testing, and validation stages of model building. By using one sample to generate the model and a separate sample to test it, you can get a good indication of how well the model will generalize to larger data sets that are similar to the current data.
- Set to Flag
- The Set to Flag node is used to derive flag fields based on the categorical values defined for one or more nominal fields. For example, your dataset might contain a nominal field, BP (blood pressure), with the values High, Normal, and Low. For easier data manipulation, you might create a flag field for high blood pressure, which indicates whether or not the patient has high blood pressure.
- Restructure
- The Restructure node can be used to generate multiple fields based on the values of a nominal or flag field. The newly generated fields can contain values from another field or numeric flags (0 and 1). The functionality of this node is similar to that of the Set to Flag node. However, it offers more flexibility. It allows you to create fields of any type (including numeric flags), using the values from another field. You can then perform aggregation or other manipulations with other nodes downstream. (The Set to Flag node lets you aggregate fields in one step, which may be convenient if you are creating flag fields.)
- Transpose
- By default, columns are fields and rows are records or observations. If necessary, you can use a Transpose node to swap the data in rows and columns so that fields become records and records become fields. For example, if you have time series data where each series is a row rather than a column, you can transpose the data prior to analysis.
- Field Reorder
- Use the Field Reorder node to define the natural order used to display fields downstream. This order affects the display of fields in a variety of places, such as tables, lists, and the Field Chooser. This operation is useful, for example, when working with wide data sets to make fields of interest more visible.
- WEX Feature Extractor
- Use the WEX Feature Extractor node to extract numeric data from unstructured text data. This
node is available only if the Watson Explorer (WEX) add-on package is installed. The node extracts
features from specified text fields, calculates a numeric value for each feature, and adds the
features as new fields. It uses the specified Watson Explorer collection to calculate vector values
for each text field with TF-IDF logic.
You can also specify categories of features (such as part of speech, sentiment phrase, and more) to be extracted in the target Watson Explorer collection.
- Graphs nodes
- Plot
- Plot nodes show the relationship between numeric fields. You can create a plot using points (also known as a scatterplot), or you can use lines. You can create three types of line plots by specifying an X Mode in the dialog box.
- Multiplot
- A multiplot is a special type of plot that displays multiple Y fields over a single X field. The Y fields are plotted as colored lines and each is equivalent to a Plot node with Style set to Line and X Mode set to Sort. Multiplots are useful when you have time sequence data and want to explore the fluctuation of several variables over time.
- Time Plot
- Time Plot nodes enable you to view one or more time series plotted over time. The series you plot must contain numeric values and are assumed to occur over a range of time in which the periods are uniform.
- Histogram
- Histogram nodes show the occurrence of values for numeric fields. They are often used to explore the data before manipulations and model building. Similar to the Distribution node, Histogram nodes are frequently used to reveal imbalances in the data.
- Distribution
- A distribution graph or table shows the occurrence of symbolic (non-numeric) values, such as mortgage type or gender, in a data set. A typical use of the Distribution node is to show imbalances in the data that can be rectified by using a Balance node before creating a model. You can automatically generate a Balance node using the Generate menu in the distribution graph or table window.
- Collection
- Collections are similar to histograms except that collections show the distribution of values for one numeric field relative to the values of another, rather than the occurrence of values for a single field. A collection is useful for illustrating a variable or field whose values change over time. Using 3-D graphing, you can also include a symbolic axis displaying distributions by category. Two dimensional Collections are shown as stacked bar charts, with overlays where used.
- Web
- Web nodes show the strength of relationships between values of two or more symbolic fields. The graph displays connections using varying types of lines to indicate connection strength. You can use a Web node, for example, to explore the relationship between the purchase of various items at an e-commerce site or a traditional retail outlet.
- Evaluation
- The Evaluation node offers an easy way to evaluate and compare predictive models to choose the best model for your application. Evaluation charts show how models perform in predicting particular outcomes. They work by sorting records based on the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and then plotting the value of the business criterion for each quantile, from highest to lowest. Multiple models are shown as separate lines in the plot.
- Modeling nodes
- Auto Classifier
- The Auto Classifier node builds several classification models using multiple algorithms and settings, evaluates them and selects the best performing. These can then be used to score new data and by combining ("ensembling") the results from those models, a more accurate prediction can be obtained.
- Auto Numeric
- The Auto Numeric node is equivalent to the Auto Classifier, but for numeric/continuous targets.
- Bayes Net
- A Bayesian network is a model that displays variables in a data set and the probabilistic, or conditional, independencies between them. Using the Netezza Bayes Net node, you can build a probability model by combining observed and recorded evidence with "common-sense" real-world knowledge to establish the likelihood of occurrences by using seemingly unlinked attributes.
- C5.0
- The C5.0 node builds either a decision tree or a rule set. The model works by splitting the sample based on the field that provides the maximum information gain at each level. The target field must be categorical. Multiple splits into more than two subgroups are allowed.
- C&R Tree
- The Classification and Regression (C&R) Tree node generates a decision tree that you can use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered “pure” if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
- CHAID
- The Chi-squared Automatic Interaction Detection (CHAID) tree node generates decision trees by using chi-square statistics to identify optimal splits. CHAID first examines the crosstabulations between each of the input fields and the outcome, and tests for significance using a chi-square independence test. If more than one of these relations is statistically significant, CHAID selects the input field that is the most significant (smallest p value). If an input has more than two categories, these are compared, and categories that show no differences in the outcome are collapsed together.
- Tree-AS
- The Tree-AS node can be used with data in a distributed environment. In this node you can choose to build decision trees using either a CHAID or Exhaustive CHAID model. CHAID, or Chi-squared Automatic Interaction Detection, is a classification method for building decision trees by using chi-square statistics to identify optimal splits.
- Random Trees
- The Random Trees node is similar to the C&RT node; however, the Random Trees node is designed to process big data to create a single tree. The Random Trees tree node generates a decision tree that you use to predict or classify future observations. The method uses recursive partitioning to split the training records into segments by minimizing the impurity at each step, where a node in the tree is considered pure if 100% of cases in the node fall into a specific category of the target field. Target and input fields can be numeric ranges or categorical (nominal, ordinal, or flags); all splits are binary (only two subgroups).
- Random Forest
- The Random Forest node is a tree-based classification and prediction method that is built on Classification and Regression Tree methodology, it generates a decision tree that you use to predict or classify future observations. This prediction method uses recursive partitioning to split the training records into segments with similar output field values.
- Decision List
- The Decision List node identifies subgroups, or segments, that show a higher or lower likelihood of a given binary outcome relative to the overall population. For example, you might look for customers who are unlikely to churn or are most likely to respond favorably to a campaign. You can incorporate your business knowledge into the model by adding your own custom segments and previewing alternative models side by side to compare the results. Decision List models consist of a list of rules in which each rule has a condition and an outcome. Rules are applied in order, and the first rule that matches determines the outcome.
- Time Series
- The Time Series node can be used to estimate and build exponential smoothing, univariate Autoregressive Integrated Moving Average (ARIMA), or multivariate ARIMA (or transfer function) models for time series, and produce forecasts based on the time series data.
- GLE
- The Generalized Linear Engine node uses a variety of statistical techniques to support both classification and continuous predicted values. Unlike many algorithms, the target does not need to have a normal distribution.
- Linear
- Linear regression models predict a continuous target based on linear relationships between the target and one or more predictors.
- Linear-AS
- Linear regression is a common statistical technique for classifying records based on the values of numeric input fields. Linear regression fits a straight line or surface that minimizes the discrepancies between predicted and actual output values. Linear-AS can run when connected to IBM SPSS Analytic Server.
- LSVM
- The Linear Support Vector Machine (LSVM) is a classification algorithm that is particularly suited for use with wide data sets, that is, those with a large number of predictor fields.
- Logistic
- Logistic regression is a statistical technique for classifying records based on values of input fields. It is analogous to linear regression but takes a categorical target field instead of a numeric range.
- Neural Net
- The Neural Net node uses a simplified model of the way the human brain processes information. It works by simulating a large number of interconnected simple processing units that resemble abstract versions of neurons. Neural networks are powerful general function estimators and require minimal statistical or mathematical knowledge to train or apply.
- KNN
- Nearest Neighbor Analysis is a method for classifying cases based on their similarity to other cases. In machine learning, it was developed as a way to recognize patterns of data without requiring an exact match to any stored patterns, or cases. Similar cases are near each other and dissimilar cases are distant from each other. Thus, the distance between two cases is a measure of their dissimilarity.
- Cox
- Cox Regression is used for survival analysis, such as estimating the probability that an event has occurred at a certain time. For example, a company is interested in modeling the time to churn in order to determine the factors that are associated with customers who are quick to switch to another service.
- PCA/Factor
- The Principal Components Analysis node aims to reduce the complexity of data by finding a smaller number of derived fields that effectively summarizes the information in the original set of fields.
- SVM
- The SVM node enables you to use a support vector machine to classify data. SVM is particularly suited for use with wide datasets, that is, those with a large number of predictor fields. You can use the default settings on the node to produce a basic model relatively quickly, or you can use the Expert settings to experiment with different types of SVM model.
- Feature Selection
- The Feature Selection node screens input fields for removal based on a set of criteria (such as the percentage of missing values); it then ranks the importance of remaining inputs relative to a specified target. For example, given a data set with hundreds of potential inputs, which are most likely to be useful in modeling patient outcomes?
- Discriminant
- Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant function (or, for more than two groups, a set of discriminant functions) based on linear combinations of the predictor variables that provide the best discrimination between the groups. The functions are generated from a sample of cases for which group membership is known; the functions can then be applied to new cases that have measurements for the predictor variables but have unknown group membership.
- Association Rules
- Association rules are statements of the following form.
if condition(s) then prediction(s)
For example, "If a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." The Association Rules node extracts a set of rules from the data, pulling out the rules with the highest information content.
- Apriori
- The Apriori node discovers association rules in the data. Association rules are statements of the following form: if antecedent(s) then consequent(s) For example, "if a customer purchases a razor and after shave, then that customer will purchase shaving cream with 80% confidence." Apriori extracts a set of rules from the data, pulling out the rules with the highest information content. Apriori offers five different methods of selecting rules and uses a sophisticated indexing scheme to efficiently process large data sets.
- Sequence
- The Sequence node discovers patterns in sequential or time-oriented data, in the format bread -> cheese. The elements of a sequence are item sets that constitute a single transaction. For example, if a person goes to the store and purchases bread and milk and then a few days later returns to the store and purchases some cheese, that person's buying activity can be represented as two item sets. The first item set contains bread and milk, and the second one contains cheese. A sequence is a list of item sets that tend to occur in a predictable order. The Sequence node detects frequent sequences and creates a generated model node that can be used to make predictions.
- Kohonen
- The Kohonen node generates a type of neural network that can be used to cluster the data set into distinct groups. When the network is fully trained, records that are similar should be close together on the output map, while records that are different will be far apart. You can look at the number of observations captured by each unit in the model nugget to identify the strong units. This may give you a sense of the appropriate number of clusters.
- Anomaly
- The Anomaly node identifies outliers, or unusual cases, in the data. Unlike other modeling methods that store rules about unusual cases, anomaly detection models store information on what normal behavior looks like. This makes it possible to identify outliers even if they do not conform to any known pattern, and it can be particularly useful in applications, such as fraud detection.
- K-Means
- K-Means is an unsupervised algorithm used to cluster the dataset into distinct groups. Instead of trying to predict an outcome, k-means tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
- TwoStep
- The TwoStep Cluster node provides a form of cluster analysis. It can be used to cluster the dataset into distinct groups when you don't know what those groups are at the beginning. As with Kohonen nodes and K-Means nodes, TwoStep Cluster models do not use a target field. Instead of trying to predict an outcome, TwoStep Cluster tries to uncover patterns in the set of input fields. Records are grouped so that records within a group or cluster tend to be similar to each other, but records in different groups are dissimilar.
- TwoStep-AS
- TwoStep Cluster is an exploratory tool that is designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent.
- Isotonic-AS
- The Isotonic-AS node in SPSS® Modeler is implemented in Spark.
- K-Means-AS
- K-Means is one of the most commonly used clustering algorithms. It clusters data points into a predefined number of clusters. The K-Means-AS node in SPSS® Modeler is implemented in Spark. For details about K-Means algorithms, see K-Means-AS.
- XGBoost-AS
- XGBoost© is an advanced implementation of a gradient boosting algorithm. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. XGBoost is very flexible and provides many parameters that can be overwhelming to most users, so the XGBoost-AS node in SPSS® Modeler exposes the core features and commonly used parameters. The XGBoost-AS node is implemented in Spark.
- XGBoost Tree
- XGBoost Tree© is an advanced implementation of a gradient boosting algorithm with a tree model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. XGBoost Tree is very flexible and provides many parameters that can be overwhelming to most users, so the XGBoost Tree node in SPSS® Modeler exposes the core features and commonly used parameters. The node is implemented in Python.
- XGBoost Linear
- XGBoost Linear© is an advanced implementation of a gradient boosting algorithm with a linear model as the base model. Boosting algorithms iteratively learn weak classifiers and then add them to a final strong classifier. The XGBoost Linear node in SPSS® Modeler is implemented in Python.
- One-Class SVM
- The One-Class SVM© node uses an unsupervised learning algorithm. The node can be used for novelty detection. It will detect the soft boundary of a given set of samples, to then classify new points as belonging to that set or not. This One-Class SVM modeling node is implemented in Python and requires the scikit-learn© Python library.
- GenLin
- The generalized linear model expands the general linear model so that the dependent variable is linearly related to the factors and covariates via a specified link function. The model allows for the dependent variable to have a non-normal distribution.
- Regression
- Linear regression is a common statistical technique for classifying records based on the values of numeric input fields.
- Quest
- The Quest node provides a binary classification method for building decision trees, designed to reduce the processing time required for large C&R Tree analyses while also reducing the tendency found in classification tree methods to favor inputs that allow more splits.
- Outputs nodes
- Table
- The Table node displays the data in table format, which can also be written to a file. This is useful anytime that you need to inspect your data values or export them in an easily readable form.
- Matrix
- The Matrix node enables you to create a table that shows relationships between fields. It is most commonly used to show the relationship between two categorical fields (flag, nominal, or ordinal), but it can also be used to show relationships between continuous (numeric range) fields.
- Analysis
- The Analysis node allows you to evaluate the ability of a model to generate accurate predictions. Analysis nodes perform various comparisons between predicted values and actual values (your target field) for one or more model nuggets. Analysis nodes can also be used to compare predictive models to other predictive models.
- Data Audit
- The Data Audit node provides a comprehensive first look at the data, including summary statistics, histograms and distribution for each field, as well as information about outliers, missing values, and extremes. Results are displayed in an easy-to-read matrix that can be sorted and used to generate full-size graphs and data preparation nodes.
- Transform
- Normalizing input fields is an important step before using traditional scoring techniques, such as regression, logistic regression, and discriminant analysis. These techniques carry assumptions about normal distributions of data that may not be true for many raw data files. One approach to dealing with real-world data is to apply transformations that move a raw data element toward a more normal distribution. In addition, normalized fields can easily be compared with each other—for example, income and age are on totally different scales in a raw data file but when normalized, the relative impact of each can be easily interpreted.
- Statistics
- The Statistics node gives you basic summary information about numeric fields. You can get summary statistics for individual fields and correlations between fields.
- Means
- The Means node compares the means between independent groups or between pairs of related fields to test whether a significant difference exists. For example, you can compare mean revenues before and after running a promotion or compare revenues from customers who didn't receive the promotion with those who did.
- Export node
- Data Asset Export
- You can use Data Asset export node to write to remote data sources using connections.
- Flat File
- The Flat File export node enables you to write data to a delimited text file. This is useful for exporting data that can be read by other analysis or spreadsheet software.
- Excel
- The Excel export node outputs data in Microsoft Excel .xlsx format. Optionally, you can choose to automatically launch Excel and open the exported file when the node is executed.
- SAS File Export
- This node enables you to write data in SAS format to be read into SAS or a SAS-compatible software package. You can export in three SAS file formats: SAS for Windows/OS2, SAS for UNIX, or SAS.
- Statistics File Export
- Use the Statistics File Export node to export data in IBM® SPSS® Statistics .sav format. IBM SPSS Statistics .sav files can be read by IBM SPSS Statistics Base and other modules.
CLEM functions of the Expression Builder
The following CLEM functions are available for working with data in IBM SPSS Modeler. You can enter these functions as code in various dialog boxes, such as Derive and Set To Flag nodes, or you can use the Expression Builder to create valid CLEM expressions.
Function Type | Description |
---|---|
Information | Used to gain insight into field values. For example, the function is_string returns true for all records whose type is a string. |
Conversion | Used to construct new fields or convert storage type. For example, the function to_timestamp converts the selected field to a timestamp. |
Comparison | Used to compare field values to each other or to a specified string. For example, <= is used to compare whether the values of two fields are lesser or equal. |
Logical | Used to perform logical operations, such as if, then, else operations. |
Numeric | Used to perform numeric calculations, such as the natural log of field values. |
Trigonometric | Used to perform trigonometric calculations, such as the arccosine of a specified angle. |
Probability | Returns probabilities that are based on various distributions, such as probability that a value from Student's t distribution is less than a specific value. |
Spatial | Used to perform spatial calculations on geospatial data. |
Bitwise | Used to manipulate integers as bit patterns. |
Random | Used to randomly select items or generate numbers. |
String | Used to perform various operations on strings, such as stripchar, which allows you to remove a specified character. |
SoundEx | Used to find strings when the precise spelling is not known; based on phonetic assumptions about how certain letters are pronounced. |
Date and time | Used to perform various operations on date, time, and timestamp fields. |
Sequence | Used to gain insight into the record sequence of a data set or perform operations that are based on that sequence. |
Global | Used to access global values that are created by a Set Globals node. For example, @MEAN is used to refer to the mean average of all values for a field across the entire data set. |
Blanks and null | Used to access, flag, and frequently fill user-specified blanks or system-missing values. For example, @BLANK(FIELD) is used to raise a true flag for records where blanks are present. |
Special fields | Used to denote the specific fields under examination. For example, @FIELD is used when deriving multiple fields. |
Data visualizations
To view advanced data visualizations like Spreadsheet, Data Audit, and Chart, click View data on a data node.
In the Chart view, you can highlight data in the spreadsheet and generate a new Select Node.
Create connections
You can create database connections to use as source or target nodes in a modeler flow.
Load data from a remote data set
To automatically load a remote data set:
- Create a new SPSS Modeler Flow and click the Find and Add Data icon ( ) in the toolbar. Then, click the Connection tab. Find the remote data set that you want, and click it.
- Select one remote data set and drag it to the Canvas.
- You can use the Table name on the Database import node dialog box to gain access to a database and read data from the selected table.
Connecting to data sources over SSL
If you plan to use SSL for a Db2 for Linux, UNIX, and Windows or a Big SQL connection that uses a self-signed certificate or a certificate that is signed by a local certificate authority (CA), you need to import the SSL certificate to the SPSS Modeler keystore: Export the database server self-signed certificate or local CA certificate to a file or contact your database administrator for this certificate file.
- Copy the certificate file to cluster master node for example, /tmp/ibmcert.pem or /tmp/mylabel.cert.
- Import the certificate to the SPSS Modeler keystore using a script,
/wdp/utils/importCertSpss.sh which is provided on the master node of the
cluster. The following command runs the
script:
/wdp/utils/importCertSpss.sh /tmp/ibmcert.pem
where /tmp/ibmcert.pem represents the fully qualified path to the file with the certificate that you exported in the previous step.
Save the model from a stream
When you run a flow, you notice a new model nugget created. You can right click on the model nugget to save the model. The model is saved as a PMML model and is displayed under Models of the project.
Schedule execution of SPSS Modeler stream as a job
You can execute a flow from your project automatically by creating and scheduling a job. From your project, click the Jobs tab. You can create a new job for stream execution using create job. select Type as ‘SPSS Stream run’ , select Worker as 'SPSS Worker' and choose the source asset(stream). You can execute the job using Run now or by scheduling it to run at specific time.
Import a SPSS stream file when SPSS Modeler is not installed
You can still import a .str file even when the SPSS Modeler add-on is not installed.
- Open a project in the Watson Studio Local client.
- From your project, click the Assets tab and click SPSS Modeler Flows.
- Click on import flow. Import your own SPSS Modeler Stream file (.str) clicking the Browse button.
- Type a name and description for your machine learning flow.
- Click the Create button.
Run an SPSS worker job on Watson Machine Learning
After working with flows on development side, commit the project and create a new release tag.
- Go to Watson Machine Learning to create a release project, select the project and select the release tag which you created.
- Click on Assets tab and select Flows to filter on flows on left side assets list.
- Click on job on the right side pane which bring you to the create job deployment page.
- Select Type as SPSS Stream run and select Worker as SPSS Worker. After creating a job, you will see the job in the deployments list. In order to run the job, you must first bring the release online by clicking the go live button in the top right of the project release details page. After the release is live, the newly created job will be displayed in the Deployment page. Click on the job name which will enable you to run the job. When the job is running, status messages are displayed. After the job is finished, success or failed status is displayed.