Once a model has been chosen, model training follows. The training stage involves running the model on training data specific to a computer vision task, measuring performance against ground truth and optimizing parameters to improve performance over time.
CNNs consist of three types of layers: a convolutional layer, a pooling layer and a fully connected layer. The convolutional layer is where feature extraction happens. Feature extraction entails determining and capturing key visual attributes from raw image data, such as colors, edges, shapes and textures. In the case of X-ray images with pneumonia, features to be extracted include asymmetric lung contours, bright regions that indicate inflammation or the presence of fluid (as opposed to dark, air-filled regions), clouded or opaque lung areas, and coarse or patchy textures.4 Feature extraction allows algorithms to distinguish significant relationships and patterns in visual data.
An X-ray image is treated as a matrix of pixel values. Another matrix of weights (parameters that control how much influence a given input feature has on the model’s output) known as a filter or kernel is applied to an area of the X-ray image, with a dot product calculated between the input pixel values. The filter moves, or “convolves,” across the image to extract features, and the entire process is known as a convolution. The final output from the series of dot products is called an activation map or a feature map. Each filter is tuned to respond to specific patterns, such as edges, shapes or textures, allowing the CNN to learn multiple visual features simultaneously.
The feature map is fed into a pooling layer to further reduce the map’s size and compress its dimensions. Another filter sweeps through the entire input, taking the maximum or average values within a group of cells in the feature map. This retains the most essential features, allowing the model to focus its attention on them.
The act of moving across an image to extract features, reduce dimensions and produce a classification is known as a forward pass. After this forward pass, the model applies a loss function to calculate its error or the difference between its predicted classification and the true classification.
To minimize the loss function, backpropagation is employed. Backpropagation is a backward pass to compute the gradient of the loss function with respect to each weight. Then, the gradient descent technique is implemented to update model weights and optimize the model.
Finally, the fully connected layer conducts the task of classification based on the features extracted through the previous layers and their different filters. The CNN then generates its outputs, which are probabilities for each class (in this case, normal vs. pneumonia). For the chest X-ray image classification task, this output will indicate either a normal scan or, if the likelihood passes a predetermined threshold, a scan positive for pneumonia.