Classifying Cell Samples (SVM)
Support Vector Machine (SVM) is a classification and regression technique that is particularly suitable for wide datasets. A wide dataset is one with a large number of predictors, such as might be encountered in the field of bioinformatics (the application of information technology to biochemical and biological data).
A medical researcher has obtained a dataset containing characteristics of a number of human cell samples extracted from patients who were believed to be at risk of developing cancer. Analysis of the original data showed that many of the characteristics differed significantly between benign and malignant samples. The researcher wants to develop an SVM model that can use the values of these cell characteristics in samples from other patients to give an early indication of whether their samples might be benign or malignant.
This example uses the stream named svm_cancer.str, available in the Demos folder under the streams subfolder. The data file is cell_samples.data. See the topic Demos Folder for more information.
The example is based on a dataset that is publicly available from the UCI Machine Learning Repository . The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:
Field name | Description |
---|---|
ID | Patient identifier |
Clump | Clump thickness |
UnifSize | Uniformity of cell size |
UnifShape | Uniformity of cell shape |
MargAdh | Marginal adhesion |
SingEpiSize | Single epithelial cell size |
BareNuc | Bare nuclei |
BlandChrom | Bland chromatin |
NormNucl | Normal nucleoli |
Mit | Mitoses |
Class | Benign or malignant |
For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record.