Home
Analytics
SPSS
SPSS Statistics
Data Preparation
IBM® SPSS® Data Preparation performs advanced techniques to streamline the data preparation stage, delivering faster, more accurate data analysis results.
This module is included in the SPSS Professional edition for on-premises and in the Base edition for subscription plans.
The "validate data" dialog is used to validate your data. The variables tab shows variables in your file. Start by selecting the variables you want and moving them to the "analysis ariables" list.
You can specify basic checks to apply to variables and cases in your file. For example, you can obtain reports that identify variables with a high percentage of missing values or empty cases.
Apply rules to individual variables that identify invalid values, that is, values outside a valid range or missing values. You can also create your own rules, cross-variable rules or apply predefined rules.
Automated data preparation delivers recommendations and enables users to drill in and examine the recommendations.
Manual data preparation is a complex and time-consuming process. When you need results quickly, the ADP procedure helps you detect and correct quality errors and impute missing values in one efficient step. The ADP feature provides an easy-to-understand report with comprehensive recommendations and visualizations to help you determine the right data to use in your analysis.
Perform automatic data checks and help eliminate time-consuming, tedious, manual checks by using the validate data procedure. This procedure enables you to apply rules to perform data checks based on each variable’s measure level, whether categorical or continuous. Then, determine data validity and remove or correct suspicious cases at your discretion before analysis.
IBM SPSS Data Preparation includes features including data validation, automated data preparation, optimal binning and identification of unusual cases.
With the optimal binning procedure, you can more accurately use algorithms designed for nominal attributes, such as Naive Bayes and logit models. Optimal binning enables you to bin—or set cut points for—scale variables.
Choose one of these types of optimal binning for preprocessing data before model building:
1) Unsupervised: Create bins with equal counts.
2) Supervised: Take the target variable into account to determine cut points. This method is more accurate than unsupervised. However, it is also more computationally intensive.
3) Hybrid approach: Combine the unsupervised and supervised approaches. This method is useful if you have a large amount of distinct values.