spss.Dataset Class (Python)

spss.Dataset(name,hidden,cvtDates). Provides the ability to create new datasets, read from existing datasets, and modify existing datasets. A Dataset object provides access to the case data and variable information contained in a dataset, and allows you to read from the dataset, add new cases, modify existing cases, add new variables, and modify properties of existing variables.

An instance of the Dataset class can only be created within a data step or StartProcedure-EndProcedure block, and cannot be used outside of the data step or procedure block in which it was created. Data steps are initiated with the spss.StartDataStep function. You can also use the spss.DataStep class to implicitly start and end a data step without the need to check for pending transformations. See the topic spss.DataStep Class (Python) for more information.

The argument name is optional and specifies the name of an open dataset for which a Dataset object will be created. Note that this is the name as assigned by IBM® SPSS® Statistics or as specified with DATASET NAME. Specifying name="*" or omitting the argument will create a Dataset object for the active dataset. If the active dataset is unnamed, then a name will be automatically generated for it in the case that the Dataset object is created for the active dataset.
If the Python data type None or the empty string '' is specified for name, then a new empty dataset is created. The name of the dataset is automatically generated and can be retrieved from the name property of the resulting Dataset object. The name cannot be changed from within the data step. To change the name, use the DATASET NAME command following spss.EndDataStep.
A new dataset created with the Dataset class is not set to be the active dataset. To make the dataset the active one, use the spss.SetActive function.
The optional argument hidden specifies whether the Data Editor window associated with the dataset is hidden--by default, it is displayed. Use hidden=True to hide the associated Data Editor window.
The optional argument cvtDates specifies whether IBM SPSS Statistics variables with date or datetime formats are converted to Python datetime.datetime objects when reading data from IBM SPSS Statistics. The argument is a boolean--True to convert all variables with date or datetime formats, False otherwise. If cvtDates is omitted, then no conversions are performed.
Note: Values of variables with date or datetime formats that are not converted with cvtDates are returned as integers representing the number of seconds from October 14, 1582.
Instances of the Dataset class created within StartProcedure-EndProcedure blocks cannot be set as the active dataset.
The Dataset class does not honor case filters specified with the FILTER or USE commands. If you need case filters to be honored, then consider using the Cursor class.
For release 22 Fix Pack 1 and higher, the Dataset class supports caching. Caching typically improves performance when cases are modified in a random manner, and is specified with the cache property of a Dataset object.

The number of variables in the dataset associated with a Dataset instance is available using the len function, as in:

len(datasetObj)

Note: Datasets that are not required outside of the data step or procedure in which they were accessed or created should be closed prior to ending the data step or procedure in order to free the resources allocated to the dataset. This is accomplished by calling the close method of the Dataset object.

Example: Creating a New Dataset

BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset(name=None)
datasetObj.varlist.append('numvar',0)
datasetObj.varlist.append('strvar',1)
datasetObj.varlist['numvar'].label = 'Sample numeric variable'
datasetObj.varlist['strvar'].label = 'Sample string variable'
datasetObj.cases.append([1,'a'])
datasetObj.cases.append([2,'b'])
spss.EndDataStep()
END PROGRAM.

You add variables to a dataset using the append (or insert) method of the VariableList object associated with the dataset. The VariableList object is accessed from the varlist property of the Dataset object, as in datasetObj.varlist. See the topic VariableList Class (Python) for more information.
Variable properties, such as the variable label and measurement level, are set through properties of the associated Variable object, accessible from the VariableList object. For example, datasetObj.varlist['numvar'] accesses the Variable object associated with the variable numvar. See the topic Variable Class (Python) for more information.
You add cases to a dataset using the append (or insert) method of the CaseList object associated with the dataset. The CaseList object is accessed from the cases property of the Dataset object, as in datasetObj.cases. See the topic CaseList Class (Python) for more information.

Example: Saving New Datasets

When creating new datasets that you intend to save, you'll want to keep track of the dataset names since the save operation is done outside of the associated data step.

DATA LIST FREE /dept (F2) empid (F4) salary (F6).
BEGIN DATA
7  57  57000 
5  23  40200
3  62  21450
3  18  21900
5  21  45000
5  29  32100
7  38  36000
3  42  21900
7  11  27900
END DATA.
DATASET NAME saldata.
SORT CASES BY dept.
BEGIN PROGRAM.
import spss
with spss.DataStep():
   ds = spss.Dataset()
   # Create a new dataset for each value of the variable 'dept'
   newds = spss.Dataset(name=None)
   newds.varlist.append('dept')
   newds.varlist.append('empid')
   newds.varlist.append('salary')
   dept = ds.cases[0,0][0]
   dsNames = {newds.name:dept} 
   for row in ds.cases:
      if (row[0] != dept):
         newds = spss.Dataset(name=None)
         newds.varlist.append('dept')
         newds.varlist.append('empid')
         newds.varlist.append('salary')
         dept = row[0]
         dsNames[newds.name] = dept
      newds.cases.append(row) 
# Save the new datasets
for name,dept in dsNames.iteritems():
   strdept = str(dept)
   spss.Submit(r"""
   DATASET ACTIVATE %(name)s.
   SAVE OUTFILE='/mydata/saldata_%(strdept)s.sav'.
   """ %locals())
spss.Submit(r"""
DATASET ACTIVATE saldata.
DATASET CLOSE ALL.
""" %locals())
END PROGRAM.

The code newdsObj = spss.Dataset(name=None) creates a new dataset. The name of the dataset is available from the name property, as in newdsObj.name. In this example, the names of the new datasets are stored to the Python dictionary dsNames.
To save new datasets created with the Dataset class, use the SAVE command after calling spss.EndDataStep. In this example, DATASET ACTIVATE is used to activate each new dataset, using the dataset names stored in dsNames.

Example: Modifying Case Values

DATA LIST FREE /cust (F2) amt (F5).
BEGIN DATA
210 4500
242 6900
370 32500
END DATA.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
for i in range(len(datasetObj.cases)):
   # Multiply the value of amt by 1.05 for each case
   datasetObj.cases[i,1] = 1.05*datasetObj.cases[i,1][0]
spss.EndDataStep()
END PROGRAM.

The CaseList object, accessed from the cases property of a Dataset object, allows you to read or modify case data. To access the value for a given variable within a particular case you specify the case number and the index of the variable (index values represent position in the active dataset, starting with 0 for the first variable in file order, and case numbers start from 0). For example, datasetObj.cases[i,1] specifies the value of the variable with index 1 for case number i.
When reading case values, results are returned as a list. In the present example we're accessing a single value within each case so the list has one element.

See the topic CaseList Class (Python) for more information.

Example: Comparing Datasets

Dataset objects allow you to concurrently work with the case data from multiple datasets. As a simple example, we'll compare the cases in two datasets and indicate identical cases with a new variable added to one of the datasets.

DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
1 57000 3
3 40200 1
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata1.
DATA LIST FREE /id (F2) salary (DOLLAR8) jobcat (F1).
BEGIN DATA
3 41000 1
1 59280 3
2 21450 1
END DATA.
SORT CASES BY id.
DATASET NAME empdata2.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj1 = spss.Dataset(name="empdata1")
datasetObj2 = spss.Dataset(name="empdata2")
nvars = len(datasetObj1)
datasetObj2.varlist.append('match')
for i in range(len(datasetObj1.cases)):
   if datasetObj1.cases[i] == datasetObj2.cases[i,0:nvars]:
      datasetObj2.cases[i,nvars] = 1
   else:
      datasetObj2.cases[i,nvars] = 0
spss.EndDataStep()
END PROGRAM.

The two datasets are first sorted by the variable id which is common to both datasets.
Since DATA LIST creates unnamed datasets (the same is true for GET), the datasets are named using DATASET NAME so that you can refer to them when calling spss.Dataset.
datasetObj1 and datasetObj2 are Dataset objects associated with the two datasets empdata1 and empdata2 to be compared.
The code datasetObj1.cases[i] returns case number i from empdata1. The code datasetObj2.cases[i,0:nvars] returns the slice of case number i from empdata2 that includes the variables with indexes 0,1,...,nvars-1.
The new variable match, added to empdata2, is set to 1 for cases that are identical and 0 otherwise.